An explicit algorithm for normal forms in small overlap monoids

We describe a practical algorithm for computing normal forms for semigroups and monoids with finite presentations satisfying so-called small overlap conditions. Small overlap conditions are natural conditions on the relations in a presentation, which were introduced by J. H. Remmers and subsequently studied extensively by M. Kambites. Presentations satisfying these conditions are ubiquitous; Kambites showed that a randomly chosen finite presentation satisfies the $C(4)$ condition with probability tending to 1 as the sum of the lengths of relation words tends to infinity. Kambites also showed that several key problems for finitely presented semigroups and monoids are tractable in $C(4)$ monoids: the word problem is solvable in $O(\min\{|u|, |v|\})$ time in the size of the input words $u$ and $v$; the uniform word problem for $\langle A|R\rangle$ is solvable in $O(N ^ 2 \min\{|u|, |v|\})$ where $N$ is the sum of the lengths of the words in $R$; and a normal form for any given word $u$ can be found in $O(|u|)$ time. Although Kambites' algorithm for solving the word problem in $C(4)$ monoids is highly practical, it appears that the coefficients in the linear time algorithm for computing normal forms are too large in practice. In this paper, we present an algorithm for computing normal forms in $C(4)$ monoids that has time complexity $O(|u| ^ 2)$ for input word $u$, but where the coefficients are sufficiently small to allow for practical computation. Additionally, we show that the uniform word problem for small overlap monoids can be solved in $O(N \min\{|u|, |v|\})$ time.

by arbitrarily applying relations in the rewriting system until no further relations apply. This permits the word problem to be solved in monoids where the Knuth-Bendix Algorithm terminates via the computation of normal forms.
In this paper we are concerned with a class of finitely presented monoids, introduced by Remmers [Rem71], and studied further by Kambites [Kam09a], [Kam09b], and [Kam11b].
If P = A | R is a monoid presentation, then we will refer to the left or right hand side of any pair (u, v) ∈ R as a relation word. A word w ∈ A * is said to be a piece of P if w is a factor of at least two distinct relation words, or w occurs more than once as a factor of a single relation word (possibly overlapping). Note that if a relation word u appears as one side of more than one relation in the presentation, then u is not considered a piece. A monoid presentation P is said to satisfy the condition C(n), n ∈ N, if the minimum number of pieces in any factorisation of a relation word is at least n. If no relation word in P equals the empty word, then P satisfies C(1). If no relation word can be written as a product of pieces, then we say that P satisfies C(n) for all n ∈ N. If a presentation satisfies C(n), then it also satisfies C(k) for every k ∈ N such that 1 ≤ k < n.
For example, the presentation P = a, b, c | abc = cba satisfies C(3). The set of pieces is P = {ε, a, b, c} and each relation word can be written as a product of exactly 3 pieces. Hence P does not satisfy C(4). Similarly, for P = a, b, c | acba = a 2 bc the set of pieces is P = {ε, a, b, c} and P satisfies C(4) but not C(5). If P = a, b, c, d | acba = a 2 bc, acba = db 3 d , the set of pieces is P = {ε, a, b, c, d, b 2 } and P satisfies C(4) but not C(5), since the relation words acba and a 2 bc can be written as the product of 4 pieces. For the presentation P = a, b, c, d | a 2 bc = a 2 bd the set of pieces is P = {ε, a, b, a 2 , a 2 b} and none of the relation words can be written as a product of pieces since neither c nor d are pieces.
If a finite monoid presentation satisfies the condition C(4), then we will refer to the monoid defined by the presentation as a small overlap monoid. Remmers initiated the study of C(3) monoids in the paper [Rem71]; see also [Hig92,Chapter 5]. If a monoid presentation A | R satisfies C(3), then the number of words in any class of R # is finite, and so the monoid defined by the presentation is infinite; see [Hig92,Corollary 5.2.16]. The word problem is solvable in C(3) monoids but the algorithm described in [Hig92,Theorem 5.2.15] has exponential complexity. Groups with similar combinatorial conditions have also been studied and such groups are called small cancellation groups. Small cancellation groups have decidable word problem; see [LS77,Chapter 5] for further details.
In [Kam11a], Kambites' showed that the probability that a randomly chosen finite monoid presentation is C(4) tends to 1 as the length of the presentation tends to infinity; and the rate of convergence appears to be rather high; see Table 1. Hence, in some sense, algorithms for small overlap monoids are widely applicable. In Kambites [Kam11b], an explicit algorithm (WpPrefix) is presented for solving the word problem for finitely presented monoids satisfying C(4). If u, v ∈ A * , then, provided that certain properties of the presentation are known already, Kambites' Algorithm requires O(min{|u|, |v|}) time. In Kambites [Kam09b], among many other results, it is shown that there exists a linear time algorithm for computing normal forms in C(4) monoids, given a preprocessing step that requires polynomial time in the size of the alphabet and the maximum length of a relation word. The normal form algorithm from [Kam09b] is not stated explicitly in [Kam09b], and it appears that the constants in the polynomial time preprocessing step are rather large; see Section 4 for further details. The purpose of this paper is to provide an explicit algorithm for computing normal forms in C(4) monoids with sufficiently small coefficients to permit its practical use. If it is already known that the input presentation satisfies C(4), and a certain decomposition of the relation words is known, then the time complexity of the algorithm we present is O(|w| 2 ) for input word w, and the space complexity is the sum of |A| and the lengths of all of the relation words in R. We will show that it is possible to show that the C(4) condition holds, and that the required decomposition of the relation words can be found, in O(N + n) time where N is the sum of the lengths of the relation words, and n is the number of relation words in Section 3.
In Section 2, we present some necessary background material, and establish some notation. In Section 3, we show that it is possible to determine the greatest n ∈ N such that a presentation P satisfies C(n) in linear time in the sum of the lengths of the relation words in P using Ukkonen's Algorithm [Ukk95]. In Section 4, we discuss the normal form algorithm of Kambites from [Kam09b]. In Section 5, we describe, prove correct, and analyse the complexity of, a subroutine that is required in the practical normal form algorithm that is the main focus of this paper. Finally, in Section 6 we present our normal form algorithm, prove that it is correct, and analyse its complexity.   Table 1: The number of 2-generated 1-relation monoids with the C(4) condition where the maximum length of a relation word is n. The values for n ≥ 14 were obtained from a uniform sample of 1000 pairs of words (l, r) of length where |l| = n and |r| ∈ {1, . . . , n}.
The algorithm for solving the word problem in C(4) monoids given in [Kam11b], and the main algorithm from the present paper, were implemented by the authors in the C++ library libsemigroups [M + 22].

Prerequisites
In this section we provide some of the prerequisites for understanding small overlap conditions and properties of small overlap monoids.
Let A be a non-empty set, called an alphabet. A word w over A is a finite sequence w = a 0 a 1 · · · a m , m ≥ 0 of elements of A. The set of all words (including the empty word, denoted by ε) over A with concatenation of words is called the free monoid on A and is denoted by A * . A monoid presentation is a pair A | R where A is an alphabet and R ⊆ A * × A * is a set of relations on A * . A monoid M is defined by the presentation A | R if M is isomorphic to A * /R # where R # ⊆ A * × A * is the least congruence on A * containing R. A finitely presented monoid is any monoid defined by a presentation A |R where A and R are finite, and such a presentation is called a finite monoid presentation. For the rest of the paper, P = A | R will denote a finite monoid presentation where R = {(W 0 , W 1 ), (W 2 , W 3 ), . . . , (W n−2 , W n−1 )}.
If s, t ∈ A * are such that there exist x i , y i ∈ A * and (W i , W i+1 ) or (W i+1 , W i ) ∈ R with s = x i W i y i and t = x i W i+1 y i , then we write s → t. If there exists a sequence of words s = w 0 , w 1 , . . . , w n = t such that w i → w i+1 for all i ∈ {0, . . . , n − 1}, then we write s * → t and we refer to such a sequence as a rewrite sequence. It is routine to verify that (s, t) ∈ R # if and only if s * → t . We say that a relation word V is a complement of a relation word W if there are relation words V = r 0 , r 1 , . . . , r n−1 = W such that either (r i , r i+1 ) ∈ R or (r i+1 , r i ) ∈ R for 0 ≤ i ≤ n − 1. We say that a complement V of W is a proper complement of W if V = W . The equivalence relation defined by the complements of the relation words is a subset of the congruence R # . We will write u ≡ v to indicate that the words u, v ∈ A * represent the same element of the monoid presented by P (i.e. that u/R # = v/R # ).
The relation words of the presentation P are W 0 , W 1 , . . . , W n−1 . A word p ∈ A * is called a piece if it occurs as a factor of W i and W j where W i = W j , or in two different places (possibly overlapping) in the same relation word in R. Note that the definition allows for the case when i = j but W i = W j . In this case, neither W i nor W j is considered a piece (unless for other reasons), because although W i is a factor of W j , it is not the case that W i = W j . By convention the empty word ε is always a piece.
Definition 2.1.1. [cf. [Kam09a]] We say that a monoid presentation satisfies the condition C(n), n ∈ N, if no relation word can be written as the product of strictly less than n pieces. The condition C(1) describes those presentations where no relation word is equal to the empty word.
Having given the definition of the condition C(4), we suppose for the remainder of the paper, that our fixed presentation P = A|R satisfies the condition C(4).
The following terms are central to the algorithms for C(4) in [Kam09b,Kam11b] and are used extensively throughout the current paper.
We say that s ∈ A * is a possible prefix of a word w ∈ A * if s is a prefix of some word w 0 ∈ A * such that w ≡ w 0 . The maximal piece prefix of u is the longest prefix of u that is also a piece; we denote the maximal piece prefix of u by X u . The maximal piece suffix of u, Z u , is the longest suffix of u that is also a piece; denoted Z u . The word Y u such that u = X u Y u Z u is called the middle word of u. Since u is a relation word of a presentation satisfying condition C(4), u cannot be written as a product of three pieces, and so the middle word Y u of u cannot be a piece. In particular, the only relation word containing Y u as a factor is u.
Using the above notation every relation word u in a C(4) presentation can be written as a product of the form X u Y u Z u . Assume that u is a complement of u. Then u = X u Y u Z u . We will write X u instead of X u , Y u instead of Y u , and Z u instead of Z u . We say that X u is a complement of X u , Y u is a complement of Y u and similarly for Z u , X u Y u and Y u Z u .
A prefix of w ∈ A * that admits a factorization of the form aXY , for XY Z a relation word, a ∈ A * and X and Y the maximal piece prefix and middle word of XY Z respectively, is called a relation prefix. If w ∈ A * has relation prefixes aXY and a X Y such that |aXY | = |a X Y | for some a, a ∈ A * , then a = a , X = X , and Y = Y as a direct consequence of the C(4) condition. A relation prefix of the form p = bX 0 Y 0 X 1 Y 1 · · · X n−1 Y n−1 X n Y n , n ≥ 1 and b ∈ A * , is called an overlap prefix if it satisfies the following: (i) Y i is a proper non-empty prefix of the middle word Y i of some relation word X i Y i Z i ; and (ii) there does not exist a factor in p of the form X m Y m beginning before the end of b.
A relation prefix aXY of a word u is called a clean relation prefix of u if u does not have a prefix of the form aXY X 0 Y 0 , where Y is a proper, non-empty prefix of Y . An overlap prefix of u that is also a clean relation prefix is called a clean overlap prefix of u. If p is a piece, then the word u is called p-active if pu has a relation prefix aXY for some a ∈ A * such that |a| < |p|.
The relation words are W 0 = a 2 bc, W 1 = acba, W 2 = adca, W 3 = bd 2 b and the set of pieces of this presentation is P = {ε, a, b, c, d}. Then W 0 is a proper complement of W 1 , and W 2 is a proper complement of W 3 . Since none of the relation words can be written as the product of less than 4 pieces, P is a C(4) presentation.
Let w = cba 2 bd 2 a. The word w has two relation prefixes: is an overlap prefix since there is no factor of the form X Wi Y Wi beginning before the end of cb. Let p = a. The word w is p-active since pw has the relation prefix X W1 Y W1 and clearly |ε| < |p|.
The following results describe some properties of presentations satisfying the C(4) condition mentioned in [Kam09a] as weak cancellativity properties.
Proposition 2.1.3 (Proposition 1 in [Kam09a]). Let w be a word in A * and aX 0 Y 0 X 1 Y 1 . . . X n Y n be an overlap prefix of w. Then there is no relation word contained in this prefix except possibly X n Y n , in case Z n = ε.
In a word w ∈ A * , an overlap prefix aX 0 Y 0 X 1 Y 1 . . . X t Y t is always contained in some clean overlap prefix aX 0 Y 0 X 1 Y 1 . . . X s Y s for s ≥ t. In addition, if a word has a relation prefix, then the shortest relation prefix will be an overlap prefix. If a word w contains a relation word u as a factor, then it has a relation prefix of the form aX u Y u for some a ∈ A * . It follows that it also has an overlap prefix, this is its shortest relation prefix. Since any overlap prefix is contained in a clean overlap prefix it also follows that w has a clean overlap prefix. Hence, taking the contrapositive, if a word in A * does not have a clean overlap prefix, then it contains no relation words as factors.
If aXY Z is a prefix of a word w, with aXY an overlap prefix and XY Z a relation word, then aXY is a clean overlap prefix of w. If aXY is a clean overlap prefix of a word w and XY is a complement of XY , then aXY and aXY are not necessarily clean overlap prefixes of words equivalent to w. However, it can be shown that such clean overlap prefixes are always overlap prefixes of words equivalent to w.
Lemma 2.1.4 (Lemma 2 in [Kam09a]). If a word w ∈ A * has clean overlap prefix aXY and w ≡ v for some v ∈ A * , then v either has aXY or aXY for XY some complement of XY as an overlap prefix; and no relation word in v overlaps this prefix, unless it is XY Z or XY Z.
In [Kam11b], Kambites describes an algorithm that takes as input two words and a piece of a given presentation that satisfies condition C(4) and returns Yes if the words are equivalent and the piece is a possible prefix of the words and No if either of these does not happen. We will refer to this algorithm in the following sections of this paper and WpPrefix(u, v, p) will be used to denote the result of this algorithm with input the words u and v and the piece p. In [Kam11b] it is shown that for a fixed C(4) presentation, this algorithm decides whether u and v are equivalent and whether p is a possible prefix of u in time O(min(|u|, |v|)) given the decomposition of relation words into the form XY Z is known.

A linear time algorithm for the uniform word problem
In [Kam09a], Kambites explores the complexity of the so-called uniform word problem for C(4) presentations. Given a finite monoid presentation A | R and two words in A * , the uniform word problem asks whether the two words represent the same element of the monoid defined by A | R . In particular, determining that the presentation satisfies the C(4) condition is part of the uniform word problem for C(4) presentations. In [Kam09a] it is shown that the uniform word problem for C(4) presentations can be solved, in the RAM model of computation, in O(|R| 2 min(|u|, |v|)) time for u, v the two words in A * and |R| the sum of the lengths of the distinct relation words in the presentation.
We will show that the uniform word problem for C(4) presentations can be solved in O(|R| min(|u|, |v|)) time, where |R| is the sum of the lengths of the relation words, by using a generalized suffix tree to represent the relation words. In the RAM model, we may assume that the following operations are constant time: random access to the letters of a word w ∈ A * from an index; concatenation of words; comparison of letters from A for a given total order on A.
Assume A is an alphabet and s = a 0 a 1 · · · a m−1 ∈ A * such that |s| = m. We use the notation s[i, j) for the factor a i . . . a j−1 of s that starts at position i and ends at position j − 1 (inclusive). A word of length m has m non-empty suffixes s[0, m), . . . , s[m − 1, m). A word x is a factor of s if and only if it is a prefix of one of the suffixes of s.
Definition 3.1.1. [cf. Section 5.2 in [Gus97]] A suffix tree for a word s of length m is a rooted directed tree with exactly m leaf nodes numbered 0 to m − 1. The nodes of a suffix tree are of exactly one of the following types: the root; a leaf node; or an internal node. Each internal node has at least two children. Each edge of the tree is labelled by a nonempty factor of s and no two edges leaving a node are labelled by words that begin with the same character. For any leaf i, 0 ≤ i < m the label of the path that starts at the root and ends at leaf i is s[i, m).
A generalized suffix tree is a suffix tree for a sequence of words S = {s 0 , s 1 , . . . , s n−1 }. In a generalized suffix tree the leaf nodes are numbered by ordered pairs (i, j) for 0 ≤ i ≤ n − 1 and 0 ≤ j ≤ |s i |. The label of the path that starts at the root node and ends at leaf (i, j) is s i [j, m) for m = |s i |. In a generalized suffix tree, a special unique character $ i is attached at the end of each word s i to ensure that each suffix corresponds to a unique leaf node in the tree. Thus a generalized suffix tree has exactly N + n leaf nodes, where N is the sum of the lengths of the words in S. See Fig. 1 for an example of generalized suffix tree.
A suffix tree for a word w over an alphabet A of length m can be constructed in O(m) time for constantsize alphabets and in O(m log |A|) time in the general case with the use of Ukkonen's algorithm [Ukk95]. If N is the sum of the lengths of the words in the set of words S, then, a generalized suffix tree for S can also be constructed in O(N + n) time; see [Gus97, Section 6.4] for further details.
A generalized suffix tree for a set of words S of total length N has at most 2(N + n) nodes. By definition, such a suffix tree has exactly N + n leaf nodes and one root. In addition, each internal node of the tree has at least two edges leaving it. These edges belong to paths that will eventually terminate at some leaf node. Hence there can exist at most N + n − 1 internal nodes and the total number of nodes in a suffix tree is at most 2(N + n).
Generalized suffix trees can be constructed and queried in linear time to provide various information about a set of words. For example, for a sequence of n words S of total length N we can find the longest subwords that appear in more than one word in O(N + n) time, find the longest common prefix of two strings in O(N + n) time, check if a word of length m is a factor of some word in S in O(m) time; see Sections 2.7 to 2.9 in [Gus97]. We are interested in utilizing generalized suffix trees in the study of C(4) presentations. Since generalized suffix trees can be queried to find the longest subwords that appear in more than one word in the set of distinct relation words in R, we can use them to find maximal piece prefixes, for example. In order to do this, however, we need to build the generalized suffix tree of distinct relation words of a presentation since a word p is a piece if it occurs as a factor of W i and W j where W i = W j , or in two different places (possibly overlapping) in the same relation word in R.
Given the set of relation words R of a presentation we want to construct the generalized suffix tree of the set of distinct relation words in R without altering the complexity of Ukkonen's algorithm. In practice, Ukkonen's algorithm for the generalized suffix tree of the set S = {s 0 $ 0 , s 1 $ 1 , . . . s n−1 $ n−1 } starts by constructing the suffix tree T for s 0 $ 0 . Then, for each i ∈ {1, . . . , n − 1}, the edges and nodes that correspond to the word s i are added to T ; see Section 6.4 in [Gus97] for more detail. We denote this step of the procedure by AddWord(T, s i ). In order to avoid adding the same word twice, we add the following step to the procedure for each i ∈ {1, . . . , n − 1}: before calling AddWord(T, s i ), we follow the path in T that starts at the root node and is labeled by s i . If this path ends at an internal node ν and one of the children of ν is a leaf node labeled by (s j , 0) for some j, then s j = s i and hence we do not call AddWord(T, s i ) for s i . This additional step only requires traversing at most |s i | + 1 nodes of T for each i ∈ {1, . . . , n − 1}. The total number of nodes of T is bounded above by 2(N + n) and hence this step does not alter the complexity of the construction of the generalized suffix tree.
Proposition 3.1.2. Let A | R be a finite monoid presentation such that the number of relation words in R is n for some n ∈ N, and let N be the sum of the lengths of the relation words in R. Then from the input presentation A | R the set of maximal piece prefixes, and suffixes, of R can be computed in O(N + n) time.
Proof. Using Ukkonen's Algorithm, for example, a generalized suffix tree for the set {W 0 $ 0 , W 1 $ 1 , . . . , W n−1 $ n−1 } of distinct relation words in R can be constructed in O(N + n) time. The maximal piece prefix of a relation word W r ∈ R can be found as follows. The path in the tree labelled by W r $ r is followed from the root to the (unique) leaf node (r, 0). Suppose that v 0 , v 1 , . . . , v m are the nodes in the path from the root node v 0 to the leaf node v m = (r, 0). Then the maximal piece prefix of u r corresponds to the path v 0 , v 1 , . . . , v m−1 . In other words, the maximal piece prefix of W r corresponds to the internal node that is the parent of the leaf node labelled (r, 0). Hence the maximal piece prefix of each relation word can be determined in O(|W r |) time, and so every maximal piece prefix can be found in O(N + n) time.
The maximal piece suffixes of the relation words can be found as follows. A generalized suffix tree for the setR of reversals of the relation words in R can be constructed in O(N + n) time, and then used, as described above, to compute the maximal piece prefixes of the reversed relation words in O(N + n) time also.
Alternatively, the generalized suffix tree for R can be used to directly compute the maximal piece suffix of a given relation word W r by finding the maximum distance, from the root, of any internal node n that is a parent of a leaf node labelled (r, i) for any i. This maximum is the length of the maximal piece suffix of W r . In this way, the maximum piece suffix of every relation word W r can be found in a single traversal of the nodes in the tree. Again, since there are 2(N + n) nodes in the tree, and the checks on each node can be performed in constant time, the maximal piece suffixes of all the relation words can be found in O(N + n) time using this approach also.
For example, the generalized suffix tree for the relation words in the presentation a, b, c, d, e | a 2 ea 3 = abcd can be seen in Fig. 1.
Using the approach described in the proof of Proposition 3.1.2, the path from the root 0 (corresponding to a 2 ea 3 ) to the leaf node labelled by (0, 0) consists of the root 0, internal nodes 1 and 2, and leaf node (0, 0). Hence the maximal piece prefix of a 2 ea 3 is aa, being the label of the path from the root to the parent 2 of the leaf node (0, 0). Similarly, the maximal piece prefix of abcd is a corresponding to the parent 1 of the node (1, 0). For the maximal piece suffix of a 2 ea 3 , the leaf nodes labelled (0, i) for any i with edge labelled by $ 0 are (0, 4), (0, 5), and (0, 6). The parents of these nodes are 2, 1, and 0, respectively, and hence the maximal piece suffix of a 2 ea 3 is aa. The only leaf node labelled (1, i) and with edge labelled $ 1 is (1, 4), and so the maximal piece suffix of abcd is ε.
Proposition 3.1.3. Let A | R be a finite monoid presentation such that the number of relation words in R is n for some n ∈ N and let N be the sum of the lengths of the relation words in R. Then from the input presentation A | R it can be determined whether or not the presentation satisfies C(4) in O(N + n) time.
Proof. In order to decide whether the presentation satisfies C(4) we start by computing the maximal piece prefix X r and the maximal piece suffix Z r for each relation word W r . By Proposition 3.1.2, this step can be performed in O(N + n) time. The presentation is C(4) if for every relation word W r , |X r | + |Z r | < |W r | and the middle word Y r is not a piece. It suffices to show that it can be determined in O(N + n) time whether or not Y r is a piece for every r.
Any word w ∈ A * is a piece if and only if w equals the longest prefix of w that is a piece. In the proof of Proposition 3.1.2, we showed how to compute the maximal piece prefix of the relation words in R using a generalized suffix tree in O(N + n) time. The longest prefix of Y r that is a piece can be determined in O(|Y r |) time by finding the last internal node ν on the path from the root of the same generalized suffix tree labelled by Y r ; the node ν is the parent of the leaf node (r, |X r |). If ν is the root node, then the longest prefix of Y r that is a piece is ε. Otherwise, the longest prefix of Y r that is a piece is the label of the path from the root node to ν. Hence determining whether or not every Y r is a piece can also be completed in total O(N + n) time.
If any of the words in R is empty, then the presentation A | R is not C(4). If all of the words are non-empty, then the number of relation words n is bounded above by the sum of the lengths of the relation words N . Hence the O(N + n) time complexity in Proposition 3.1.2 and Proposition 3.1.3 becomes O(N ).
The presentation a, b, c, d, e | a 2 ea 3 = abcd can be seen to be C(4) as follows. The only internal node on the path from the root of the suffix tree depicted in Fig. 1 labelled by ea is the root itself. Hence the maximal piece prefix of ea is ε and so ea is not a piece. Similarly, the only internal node on the path from the root labelled bcd is the root itself, and so bcd is not a piece either. Hence, by the proof of Proposition 3.1.3, the presentation a, b, c, d, e | a 2 ea 3 = abcd is C(4).
Proposition 3.1.3 allows us to prove the following theorem.
Theorem 3.1.4. Let A | R be a finite monoid presentation such that the number of relation words in R is n for some n ∈ N, let N be the sum of the lengths of the relation words in R, and let u, v ∈ A * be arbitrary. Then the uniform word problem with input the presentation A | R , and the words u and v can be solved in O((N + n) min(|u|, |v|)) time.
Proof. Given Proposition 3.1.3, the proof of this theorem is essentially identical to the proof of [Kam09a, e a a a $0 e a a a $ 0

Kambites' normal form algorithm
Let A = {a 0 , a 1 , . . . , a n−1 } be a finite alphabet and define a total order < on the elements of A by a 0 < a 1 < · · · < a n−1 . We extend this to a total order over A * , called the lexicographic order as follows. The empty word ε is less than every other word in A * . If u = a i u 0 and v = a j v 0 are words in A + , a i , a j ∈ A and u 0 , v 0 ∈ A * , then u < v whenever a i < a j , or a i = a j and u 0 < v 0 . As mentioned above, Kambites in [Kam11b] described an algorithm for testing the equivalence of words in C(4) monoids. In [Kam09b] it was shown that given a monoid M defined by a C(4) presentation A | R and a word w ∈ A * there exists an algorithm that computes the minimum representative of the equivalence class of w with respect to the lexicographic order on A * . This minimum representative is also known as the normal form of w. It is not, perhaps, immediately obvious that such a minimal representative exists, because the lexicographic order is not a well order (it is not true that every non-empty subset of A * has a lexicographic least element). However, every equivalence class of a word in a C(3) monoid is finite; see, for example, in [Hig92, Corollary 5.2.16]. Since any presentation that satisfies C(4) also satisfies C(3), there exists a lexicographically minimal representative for any w ∈ A * . We denote the lexicographically minimal word equivalent to w ∈ A * by min w.
We note that since all of the equivalence classes of a C(3) monoid are finite, any monoid satisfying C(n) for n ≥ 3, is infinite. Conversely, if A | R is a presentation for a finite monoid and this presentation satisfies C(n) for some n ∈ N, then n ∈ {1, 2}.
In [Kam09b] Kambites proved the following result.
Proposition 4.1.1 (Corollary 3 in [Kam09b]). Let A | R be a finite monoid presentation satisfying C(4) and suppose that A is equipped with a total order. Then there exists an algorithm which, given a word w ∈ A * , computes in O(|w|) time the corresponding lexicographic normal form for w.
Although Proposition 4.1.1 asserts the existence of an algorithm for computing normal forms, this algorithm is not explicitly stated in [Kam09b]. In the following paragraphs we briefly discuss the algorithm arising from [Kam09b].
We require a number of definitions; see [Ber79] for further details. A transducer T = A, B, Q, q , Q + , E is a 6-tuple that consists of an input alphabet A, an output alphabet B, a finite set of states Q, an initial state q , a set of terminal states Q + that is a subset of Q, and a finite set of transitions or edges E such The relation accepted by T is the set of all pairs accepted by T . A relation accepted by a transducer is called a rational relation. A rational relation that contains a single pair (u, v) for each u ∈ A * is called a rational function.
A deterministic 2-tape finite automaton is an 8-tuple A = A, B, Q 1 , Q 2 , q , Q + , δ 1 , δ 2 that consists of the tape-one alphabet A, the tape-two alphabet B, two disjoint state sets Q 1 and Q 2 , an initial state q ∈ Q 1 ∪ Q 2 , a set of terminal states Q + ⊂ Q 1 ∪ Q 2 and two partial functions δ 1 : ) · · · (δ tn−2 (q n−2 , a n−2 ), t n−1 , a n−1 , δ tn−1 (q n−1 , a n−1 )), and v is a lexicographic normal form}. In [Kam09b] it is shown that R # is a rational relation and lex(R # ) is a rational function. According to Lemma 5.3 in [Joh86], lex(R # ) can effectively be computed from a finite transducer for R # . Kambites describes the construction of a finite transducer for lex(R # ) in [Kam09b]. The steps for constructing this transducer are the following: • Starting from the C(4) presentation A | R , an abstract machine called a 2-tape deterministic prefix rewriting automaton with bounded expansion can be computed. The construction is given in the proof of Theorem 2 in [Kam09b]. The relation accepted by this automaton is R # .
• Using the construction in the proof of Theorem 1 in [Kam09b], the 2-tape deterministic prefix rewriting automaton can be used to construct a transducer T realizing R # .
Let δ be the length of the longest relation word in R and let P be the set of pieces of the presentation. The set A ≤k for k ∈ N consists of all words in A * with length less or equal to k. Similarly, A <k consists of all words in A * with length less than k. In addition, let $ be a new symbol not in A. The set A <k $ consists of words u$ such that u ∈ A <k .
The state set of the transducer T , given in the proof of Theorem 1 in [Kam09b], is the set Hence, the number of states of the transducer is extremely large even for relatively small presentations. For example, let a, b, c | a 2 bc = acba be the presentation. In this case |A| = 3, δ = 4 and P = {ε, a, b, c}. The size of the state set Q = C × C × P of the corresponding transducer is |C| 2 · |P | = 4|C| 2 . Since and |Q| = 4518864080644.
Another approach arising from [Kam09b] for the computation of normal forms is the construction of a deterministic 2-tape automaton accepting lex(R # ). This also begins by constructing the transducer T . The process arising from [Kam09b] for the construction of the automaton is: perform the two steps given above to construct the transducer T , then: • using the construction in the proof of Proposition 1 in [Kam09b], a deterministic 2-tape automaton accepting R # can be constructed starting from the transducer T ; • the proof of Theorem 5.1 in [Joh85] describes the construction of a deterministic 2-tape automaton accepting lex(R # ), starting from the deterministic 2-tape automaton that accepts R # .
The state set Q = Q 1 ∪ Q 2 of the 2-tape automaton that accepts R # in the second step is the same as the state set of the transducer T , partitioned in two disjoint sets Q 1 and Q 2 . The state set Q = Q 1 ∪ Q 2 of the 2-tape automaton that accepts lex(R # ) is the union of the sets Q 1 = Q 1 × 2 Q1 and Q 2 = Q 2 × 2 Q1 , hence the number of states of this automaton is greater than the number of states of the transducer T .
Although the approach described in [Kam09b] allows normal forms for words in a C(4) presentation to be found in linear time, it is impractical to use a transducer with such a large state set. The current article arose out of a desire to have a practical algorithm for computing normal forms in C(4) monoids.

Possible prefix algorithm
Before describing the procedure for finding normal forms, we describe an algorithm that takes as input a word w 0 and a possible prefix piece p of w 0 and returns a word equivalent to w 0 with prefix p. As mentioned above the algorithm WpPrefix, described in [Kam11b] can decide whether a piece p is a possible prefix of some word w 0 by calling WpPrefix(w 0 , w 0 , p). w 0 ← aXY Zu where XY Z is a proper complement of XY Z such that p is a prefix of aX 4: end if 5: return w 0 Lemma 5.1.1. Let w ∈ A * be arbitrary. If there exists a piece p that is a possible, but not an actual, prefix of w, then the shortest relation prefix of w is a clean overlap prefix.
Proof. Since p is a possible prefix but not a prefix of w, w contains at least one relation word and hence has a relation prefix. Let aXY be the shortest relation prefix of w. Then aXY is an overlap prefix. If aXY is not clean, then w has a prefix of the form aXY X 0 Y 0 such that Y is a proper non-empty prefix of Y . Hence the shortest clean overlap prefix of w contains aXY and hence aXY is also a prefix of every v such that v ≡ w by Lemma 2.1.4. Let w 0 be a word equivalent to w that has prefix p. Then either p is a prefix of aXY or p contains aXY . In the former case this would mean that p is also a prefix of w, which is a contradiction. In the latter case XY is a factor of p. Since p is a piece this implies that XY is also a piece which is a contradiction since X is the maximal piece prefix of the relation word XY Z. We have shown that the shortest relation prefix of w is a clean overlap prefix, as required.
Lemma 5.1.2. Let w ∈ A * be arbitrary. If w has a piece p as a possible, but not an actual, prefix, then the shortest relation prefix of w can be found in constant time, given the suffix tree for the relation words in R.
Proof. Suppose that S = {W 0 , W 1 , . . . , W n−1 } is the set of relation words and let δ be the length of the longest relation word in R. We want to find the shortest relation prefix tX Wi Y Wi for some t ∈ A * , and W i ∈ S. Since X Wi Y Wi and p are factors of relation words, |p|, |X Wi Y Wi | ≤ δ. Since tX Wi Y Wi is the shortest relation prefix of w, t is prefix of every word equivalent to w, and hence t is a proper prefix of p. In particular, |t| < |p| < δ. If |w| ≥ 2δ, then we define v to be the prefix of w of length 2δ; otherwise, we define v to be w. In order to find the shortest relation prefix of w it suffices to find the shortest relation prefix of v. For a given presentation, the length of v is bounded above by the constant value 2δ.
In practice, in order to find the shortest relation prefix of v, we construct a suffix tree for all words X Wi Y Wi such that X Wi Y Wi Z Wi is a relation word of the presentation. This is done in O(N + n) time, for N the sum of the lengths of the relation words in the presentation. A factor of v has the form X Wi Y Wi for some i if and only if this factor labels a path that starts at the root node of the tree and ends at some leaf node labelled by (i, 0). Hence the shortest relation prefix of v can be found by traversing the nodes of the tree at most |v| times. Since the length of v is at most 2δ this can be achieved in constant time.
The complexity of this procedure is O((N + n)|v|) = O(2δ(N + n)) which is independent of the choice of w.
Next, we will show that Algorithm 1 is valid.
Proposition 5.1.3. If w 0 , p ∈ A * are such that p is piece and a possible prefix of w 0 , then ReplacePrefix(w 0 , p) returns a word that is equivalent to w 0 and has prefix p in O(|w 0 |) time, given the suffix tree for the relation words in R.
Proof. We will prove that the algorithm returns the correct result using induction on the number k of recursive calls in line 2. Note that if p is a possible prefix of w 0 and w 0 contains no relation words, then p is a prefix of w 0 . On the other hand, if p is not a prefix of w 0 , then w 0 must contain a relation word, and hence a clean overlap prefix.
We first consider the base case, when k = 0. Let p be a piece and w 0 a word such that ReplacePrefix(w 0 , p) terminates without making a recursive call. This only happens in case p is already a prefix of w 0 and the algorithm returns w 0 in line 5. Hence when k = 0 the word returned by ReplacePrefix(w 0 , p) is w 0 and has prefix p.
Next, we let k > 0 and assume that the algorithm returns the correct result when termination occurs after strictly fewer than k recursive calls. Now let p be a piece and w 0 a word such that ReplacePrefix(w 0 , p) terminates after k recursive calls. It suffices to prove that the first recursive call returns the correct output.
If p is already a prefix of w 0 a recursive call does not happen, hence we are in the case where p is not a prefix of w 0 . Since p is a possible prefix of w 0 , there exists a word that is equivalent but not equal to w 0 and that has p as a prefix. This means that w 0 has a relation prefix and hence it has a clean overlap prefix of the form aXY . By Lemma 2.1.4, every word equivalent to w 0 has aXY for XY a complement of XY , as a prefix. Hence since p is not a prefix of w 0 , p must be a prefix of aXY , for XY a proper complement of XY . Since p is a piece, |p| ≤ |aX| because otherwise a prefix of XY longer than X would be a piece. Hence p is a prefix of aX. It follows that there exists a word equivalent to w 0 in which aXY is followed by Z and we can rewrite XY Z to XY Z. This implies that if w is the suffix of w 0 following aXY , then Z is a possible prefix of w . In particular, by the inductive hypothesis, ReplacePrefix(w , Z) is Zu for some u ∈ A * and aXY Zu is a word equivalent to w 0 that has prefix p. Therefore, by induction, the algorithm will return aXY Zu in line 5 after making the recursive call in line 2.
It remains to show that the output of ReplacePrefix(w 0 , p) can be computed in O(|w 0 |) time. The recursive calls within ReplacePrefix(w 0 , p) always have argument which is a factor, even a suffix, of w 0 . Hence if WpPrefix(w 0 , w 0 , p)=Yes, then p is a possible prefix of w 0 , and the number of recursive calls in Algorithm 1 is bounded above by the length of w 0 .
Let δ be the length of the longest relation word of our presentation. In line 1, we begin by checking if p is a prefix of w 0 . Clearly, this can be done in |p| steps and since p is a piece, |p| < δ. In line 1, we also search for the clean overlap prefix of w 0 . As shown in Lemmas 5.1.1 and 5.1.2, this can be done in constant time. Next, in line 2 we delete a prefix of length |Z| from the output of ReplacePrefix(w , Z). Since |Z| < δ, the complexity of this step is also constant for a given presentation. The search for a complement XY Z of XY Z such that p is a prefix of aX can be performed in constant time since the number of relation words is constant for a given presentation and |p| < δ. In line 3, we concatenate words to obtain a word equivalent to w. In every recursive call we concatenate three words hence the complexity of this step is also constant. As we have already seen, the number of recursive calls of the algorithm is bounded above by the length of For the following examples we will use the notation w i and p i for the parameters of the ith recursive call of ReplacePrefix(w, p) and we let w 0 = w and p 0 = p.
The algorithm WpPrefix(w, w, d) returns Yes and we want to find ReplacePrefix(w, d).
We begin with w 0 = acbdb 2 d, p 0 = d and u 0 = ε. Clearly w 0 does not begin with d but using the process described in Lemma 5.1.2, we can find the clean overlap prefix of w 0 which is acb = X W0 Y W0 and hence w 0 satisfies the conditions of line 1. In line 2, w 1 ← db 2 d, p 1 ← a and in order to compute u 1 we need to compute ReplacePrefix(db 2 d, a). Since w 1 does not begin with a we need to find the clean overlap prefix of w 1 which is db 2 = X W2 Y W2 . Now w 2 = d, p 2 = d and ReplacePrefix(d, d) returns d. Now w 1 will be rewritten to a complement of db 2 d that begins with a, hence we choose one of W 0 and W 1 . If we choose W 0 , w 1 ← acba and w 0 ← db 2 dcba. If we choose W 1 , w 1 ← a 2 bc and w 0 ← db 2 dabc. In both cases the algorithm returns a word equivalent to w that begins with d.

A practical normal form algorithm
In this section we describe a practical algorithm for computing lexicographically normal forms in C(4) monoids. This section has four subsections: the first contains a description of the algorithm; the second a proof that the algorithm returns a word equivalent to the input word; the third contains a proof that the algorithm returns the lexicographically least word equivalent to the input word; and in the final section we consider the complexity of the algorithm.

Statement of the algorithm
In this section, we describe the main algorithm of this paper for computing the lexicographically least word equivalent to an input word. Roughly speaking, the input word is read from left to right, clean overlap prefixes of the form uXY for u ∈ A * are found and replaced with a lexicographically smaller word if possible. Subsequently, the next clean overlap prefix of this form after uXY is found, and the process is repeated. The algorithm is formally defined in Algorithm 2.
if W = X r Y r Z r , w = Z r w , w is Z r -active for some proper complement Z r of Z r , w is not Z r -active and a is a suffix of Z r with aw = X s Y s w and WpPrefix(w , w , Z s )=Yes then 4: if there exists a proper complement of X s Y s Z s with prefix a that is lexicographically less than X s Y s Z s then 5: X t Y t Z t ← the lexicographically minimal proper complement of X s Y s Z s that has prefix a 6:

Equivalence
In this section we show that NormalForm terminates and the word returned is equivalent to the input word w 0 . We begin by observing that NormalForm rewrites v and w in lines 11-12, 15-16, 21-22, 25-26, and 29-30. For the remainder of this section, v i and w i will be v and w after the i-th time the algorithm has rewritten v and w.
The following result will be used to prove that Algorithm 2 terminates and that the word returned by the algorithm is equivalent to its input. We have already proved that if a piece p is a possible prefix of a word v, then algorithm ReplacePrefix(v, p) returns a word equivalent to v with prefix p. If w ∈ A * , XY is a clean overlap prefix of w, w is the suffix of w following XY , and Z is a possible prefix of w , then w ≡ XY Zu where Zu = ReplacePrefix(w , Z). This is straightforward since Z is a possible prefix of w and hence w ≡ Zu for Zu = ReplacePrefix(w , Z).
Lemma 6.2.1. Assume that w 0 ∈ A * is the input to NormalForm. Then at each step of NormalForm(w 0 ), v i w i ≡ w 0 .
Proof. We proceed by induction on i. For i = 0, v 0 = ε and hence v 0 w 0 = w 0 . Let k ∈ N and assume that v k w k ≡ w 0 . We will prove that v k+1 w k+1 ≡ w 0 .
In the cases of lines 21-22 and 29-30, it is clear that some prefix of w k is transferred to the end of v k+1 . In particular, v k w k = v k+1 w k+1 ≡ w 0 . In lines 15-16, a prefix of w k is transferred to the end of v k+1 again and Algorithm 1 is applied. Hence v k w k ≡ v k+1 w k+1 ≡ w 0 . In lines 25-26 we rewrite the relation word XY Z to XY Z. Since XY Z begins after the beginning of w k , there exists some s equivalent to w k which is obtained by the application of this rewrite. Hence v k w k ≡ v k s with aXY being a prefix of s, and v k s = v k+1 w k+1 . It follows that v k+1 w k+1 ≡ w 0 . Finally, in the case of lines 11-12 the result follows immediately from the use of WpPrefix in line 9.
In [Hig92, Theorem 5.2.14] it is shown that if w 0 , w ∈ A * are such that w ≡ w 0 , then where δ is the maximum length of a relation word in R. Since in every step of NormalForm(w 0 ), v i w i ≡ w 0 by Lemma 6.2.1, we conclude that |v i w i | ≤ δ|w 0 | for all i. Algorithm 2 terminates when w i = ε. Since the length of v i+1 is strictly greater than the length of v i , Algorithm 2 terminates for any w 0 ∈ A * and the while loop of line 2 will be repeated at most δ|w 0 | times.
Combining Lemma 6.2.1 with the fact that NormalForm terminates, we obtain the following corollary.
Corollary 6.2.2. If w 0 ∈ A * is arbitrary, then the word v returned by NormalForm(w 0 ) is equivalent to w 0 .

Minimality
We require the following definition and a number of related results to establish that the word returned by NormalForm(w 0 ) is the lexicographic minimum word equivalent to w 0 .
Definition 6.3.1. Let w ∈ A * . A middle word Y is called a special middle word of w if w = pY q for some p, q ∈ A * and there exists a word p XY Zq that is equivalent to w, p X ≡ p, and Zq ≡ q.
In other words, Y is a special middle word of w if it is a subword of w and there exists a word equivalent to w containing XY Z as a factor in the obvious place. Note that if a relation word XY Z is a factor of w, then it follows directly from the definition that Y is a special middle word of w. Since middle words are not pieces, it follows that a middle word Y i will never occur as a factor of a middle word Y j unless Y i = Y j . So, if Y i and Y j are special middle words of a word w and they begin at the same position in w, then Y i = Y j . In the following lemma, we prove that the special middle words of an arbitrary word w do not overlap with each other. Lemma 6.3.2. Let Y i and Y j be special middle words of w where Y i occurs strictly before Y j . Then w = pY i qY j r for some p, q, r ∈ A * .
Proof. Assume that Y i and Y j are such that Y i and Y j overlap as factors in w. Let Y i = xy and Y j = yz be such that w = pY i zr = pxY j r = pxyzr for some x, y, z ∈ A * .
Since Y i is a special middle word of w = pY i zr, there exists r ∈ A * such that zr ≡ Z i r . If z is a prefix of Z i , then yz = Y j is a factor of the relation word X i Y i Z i , a contradiction since Y j is not a piece. If z is a prefix of Z i r that is longer than Z i , then the suffix yZ i of X i Y i Z i which is longer than Z i is a factor of Y j and this contradicts the definition of Z i . It follows that z is not a prefix of Z i r and so Z i r = zr and, in particular, zr contains a relation word as a factor and hence zr has a relation prefix and a clean overlap prefix. Hence zr = aXY q for aXY a clean overlap prefix and some q ∈ A * . In addition, |a| < |z| since a is contained in every word equivalent to aXY q by Lemma 2.1.4, aXY q ≡ Z i r and z is not a prefix of Z i . Since |a| < |z|, there exists a suffix X of z that is a prefix of X and zr = aXY q = zX Y q where X is such that X = X X . Since Y j is a special middle word of w and w = pxY j r = pxY j X Y q , X Y q is equivalent to Z j t for some t ∈ A * . Clearly, Z j is not a prefix of X Y because that would imply that a suffix of X j Y j Z j longer than Z j is a factor of XY . In addition, X Y is not a prefix of Z j because Y is not a piece. It follows that Z j t = X Y q . Since Z j t ≡ X Y q but Z j is not a prefix of X Y q it follows that X Y q has a clean overlap prefix bX * Y * with |b| < |X Y | because otherwise X Y would be a factor of all words equivalent to X Y q and hence a factor of Z j . If a prefix of X Y longer than X is a factor of b, then a prefix of XY longer than X is a factor of Y j Z j , a contradiction. It follows that either X * Y * is a factor of X Y or Y is a factor of X * Y * and both cases lead to a contradiction.
We conclude that Z j cannot be a prefix of any word equivalent to r. In particular, it follows that Y j is not a special middle word of w which contradicts the initial assumption and hence Y i , Y j do not overlap.
We can order the special middle words of a word w ∈ A * by their order of appearance as factors of w from left to right. In particular, for every w ∈ A * we will refer to the sequence of special middle words (Y 0 , Y 1 , . . . , Y n ) of w where i < j whenever Y i occurs to the left of Y j in w.
The next lemma collects some basic facts about the decomposition of the relation words u into X u Y u Z u that follow more or less immediately from the definition of the C(4) condition.
(ii) If W i overlaps W j in a word w ∈ A * , then either: Z i overlaps with X j or X i overlaps with Z j .
(iii) If Y u = sY u for some s, Y u ∈ A * with s = ε, then X u s is not a factor of any relation word other than u.
To prove that the word returned by NormalForm is the lexicographically least word equal to the input word, we establish the following theorem. We establish the proof of Theorem 6.3.4 in a sequence of lemmas. We start by showing that if u ≡ v, then there is a 1-1 correspondence between the special middle words of u and v. Using the properties in Lemma 6.3.3 we obtain the following lemma to show that Y is a special middle word of u, if and only if some complement Y of Y is a special middle word of v. Lemma 6.3.5. Let u, v ∈ A * . Assume that Y is a special middle word of u such that u = pY q for some p, q ∈ A * . Then u ≡ v if and only if one of the following holds: (i) v = p Y q such that p ≡ p and q ≡ q; or (ii) u = pY q ≡ rXY Zt ≡ rXY Zt ≡ p Y q = v and p ≡ rX, q ≡ Zt, p ≡ rX and q ≡ Zt.
Proof. Clearly, if (i) or (ii) hold then u ≡ v. It remains to show that if u ≡ v then (i) or (ii) holds for v. Assume that u = pY q and let v ≡ u. Then there exists a rewrite sequence u = pY q * − → v. We will prove that no relation applied in this rewrite sequence can overlap Y unless it is XY Z.
It is clear that since Y is not a piece, Y is not a factor of any relation word except XY Z and no relation word is a factor of Y . We start by showing that no relation word in the rewrite sequence pY q * − → v overlaps with a proper suffix of Y . Since Y is a special middle word of u, u = pY q ≡ rXY Zt for some r, t ∈ A * such that p ≡ rX and q ≡ Zt. Since q ≡ Zt, it follows by Lemma 2.1.4 that there are two cases to consider: either q has a clean overlap prefix aX 1 Y 1 with |a| ≥ |Z| and Z is a prefix of all words equivalent to q or q has a clean overlap prefix aX 1 Y 1 with |a| < |Z| and Z is a prefix of aX 1 Y 1 for some complement X 1 Y 1 of X 1 Y 1 . If the former holds, then no relation word in the rewrite sequence can overlap with a suffix of Y because that would imply that either a suffix of XY Z longer than Z is a factor of a different relation word or that a relation word is a factor of Y Z. Both of these lead to contradictions. Assume that q = aX 1 Y 1 t for some t ∈ A * . By Lemma 2.1.4 every word equivalent to q has either aX 1 Y 1 or aX 1 Y 1 as a prefix. If a relation word in the rewrite sequence overlaps a suffix of Y , then one of the following holds: • it is a factor of Y a which is a contradiction since Y a is a factor of Y Z; • it is a factor of Y aX 1 Y 1 for some complement X 1 Y 1 of X 1 Y 1 which is a contradiction because it implies that the relation word can be written as a product of 2 pieces; or • the relation word contains X 1 Y 1 as a factor which is clearly a contradiction.
It follows that no relation word in the rewrite sequence overlaps with a suffix of Y unless it is XY Z in the obvious place. Similarly, we can prove that no relation word in the rewrite sequence can overlap with a prefix of Y . Similar to the previous case, since p ≡ rX there are two cases to consider: either X is a suffix of all words equivalent to p or it follows by Lemma 6.3.3(i) and (ii) that p ≡ r X 2 Y 2 Z 2 b for some r , b ∈ A * such that |b| < |X| and there exists a word r X 2 Y y Z 2 b for X 2 Y 2 Z 2 some complement of X 2 Y 2 Z 2 , that has X as a suffix. It follows by Lemma 6.3.3(ii) and (iv) that any word overlapping with X 2 Y 2 Z 2 that is not X 2 Y 2 Z 2 in a rewrite sequence either overlaps with a prefix of X 2 or with a suffix of Z 2 and is XY Z. It follows that no relation word in the rewrite sequence overlaps with a prefix of Y unless it is XY Z in the obvious place.
In conclusion, no relation word in the rewrite sequence u * − → v overlaps with Y unless it is a complement of XY Z and hence (i) or (ii) holds for every v such that v ≡ u.
It follows by Lemma 6.3.5 that if u ≡ v and Y i is a special middle word of u, then some complement Y i of Y i is a special middle word of v. Note that it also follows by Lemma 6.3.5(i) and (ii) that if Y i , Y j are special middle words of u and Y i occurs to the left of Y j in u, then the corresponding special middle word Y i occurs to the left of Y j in v. In particular, the following result holds as a corollary of Lemma 6.3.5.
The next lemma collects some basic facts about special middle words that follow more or less immediately from the definition of special middle words and the definition of clean overlap prefixes. We will use these properties in various results in the remainder of this section. Lemma 6.3.7. Let w ∈ A * .
(i) If Y is a special middle word of w and w = pY q for some p, q ∈ A * , then Z is a possible prefix of q and WpPrefix(q, q, Z)=Yes.
(ii) If w = XY w for some w ∈ A * and WpPrefix(w , w , Z)=Yes, then XY is a clean overlap prefix of w.
(iii) If Y is the middle word in line 18 of Algorithm 2 and the condition of line 19 is satisfied, then Y is not a special middle word of w = aXY w .
(iv) If w = a 0 X 0 Y 0 w for some w ∈ A * and Y 0 is the left most special middle word in w, then any word equivalent to w has prefix a 0 .
We use some of the properties in Lemma 6.3.7 to prove the next lemma which refines the form of a word with respect to a C(4) monoid presentation. Lemma 6.3.8. Let u ∈ A * . Assume that (Y 0 , Y 1 , . . . , Y n ), for some n ≥ 0, is the sequence of special middle words of u. Then where a i ∈ A * and either X i = X i or X i is a proper suffix of X i and a i = Z i−1 for all i.
Proof. Since Y 0 is the left most special middle word of u, u = pY 0 q and p ≡ a 0 X 0 for some a 0 ∈ A * . If a relation word XY Z is a factor of p, then Y is a factor of p and hence it is a special middle word of u occurring on the left of Y 0 . This is a contradiction since Y 0 is the left most special middle word of u and hence p contains no relation words as factors. It follows that p = a 0 X 0 . Let Y k−1 , Y k be special middle words of u. Then u = rY k−1 b k Y k t for some r, b k , t ∈ A * , by Lemma 6.3.2. We will prove that b k = a k X k where either X k = X k or X k is a proper suffix of X k and a k = Z k−1 . Since Y k−1 is a special middle word of u, b k Y k t ≡ Z k−1 s for some s ∈ A * . It follows that either Z k−1 is a prefix of all words equivalent to b k Y k t or b k Y k t has a clean overlap prefix cXY with |c| < |Z k−1 | and Z k−1 is a prefix of cXY for some complement XY of XY .
In the latter case, since b k Y k t = cXY q ≡ cXY q for some q, q ∈ A * , it follows that cXY q ≡ cXY Zq for some q ∈ A * and hence Y is a special middle word of cXY q. Since cXY q is a suffix of u it follows by the definition of special middle words that Y is a special middle word of u.
We will prove that Y = Y k and hence b k = cX k = a k X k where X k = X k , as required. Clearly, if |cXY | = |b k Y k | and Y = Y k then either Y is a suffix of Y k or Y k is a suffix of Y . This is a contradiction since middle words are not pieces and hence Y = Y k . Assume that |cXY | < |b k Y k |. Then Y is a special middle word that occurs after Y k−1 and before Y k as a factor of u and this is a contradiction. Assume that |cXY | > |b k Y k |. There are two cases to consider: either Y k is a factor of XY , which is clearly a contradiction, or Y k is a factor of cXY that begins before the end of c. In this case, Y k must end before the start of Y in cXY because otherwise it would contain a factor of XY longer than X. Since Y k is a special middle word, it follows that b k Y k t ≡ b k Y k Z k t for some t ∈ A * . If Z k is a prefix of t in this case, then either Y is a factor of X k Y k Z k or XY contains a suffix of X k Y k Z k longer than Z k and both of these lead to a contradiction. It follows that Z k is not a prefix of t, and hence t has a clean overlap prefix dX * Y * with |d| < |Z k |. If d ends after the end of Y in cXY q then Y is a factor of X k Y k Z k , a contradiction. It follows that d ends before the end of Y . But then XY overlaps with X * Y * which is a contradiction because cXY is a clean overlap prefix of cXY q. It follows that Y = Y k .
In the former case, Z k−1 is a prefix of all words equivalent to b k Y k t. If b k Y k t has a clean overlap prefix cXY with |c| < |Z k−1 | and b k Y k t ≡ cXY q for some q ∈ A * and a proper complement XY of XY , then as in the previous paragraph Y = Y k and hence b k = a k X k , as required. Assume that b k Y k t does not have a clean overlap prefix cXY with |c| < |Z k−1 | such that b k Y k t ≡ cXY q for some q ∈ A * and a proper complement XY of XY . Then, since Y k is a special middle word of u, if X k is a suffix of b k then it follows by is not a clean overlap prefix contained in b , then b X k Y k is a clean overlap prefix of b k Y k t and b k = a k X k , as required. If this is not the case, then either X k is a suffix of b k but b k Y k t has a clean overlap prefix cXY with |cXY | ≤ |b k | such that no equivalent of b k Y k t has prefix cXY for XY a proper complement of XY or X k is not a suffix of b k . In the first case, b k = a k X k and hence X k = X k as required. In the second case, if X k is not a suffix of b k , then since no word equivalent to u contains a relation word between Y k−1 and Y k , it follows that X k = X k X k and the proper prefix X k of X k is a suffix of some complement Z k−1 of Z k−1 . It follows that b k = Z k−1 X k , as required.
At this point, we have proved results that explain the connection between the sequence of special middle words of a word w and the form of w or a word equivalent to w with respect to this sequence. We would like to be able to compare words based on their sequences of special middle words. We utilize the algorithm WpPrefix from [Kam11b] to do this. The following results highlight the connection between special middle words in equivalent words u and v and recursive calls from within WpPrefix(u, v, ε). Lemma 6.3.9. If WpPrefix(u j , v j , p j ) is a recursive call from within WpPrefix(u, v, ε) for u, v ∈ A * and u ≡ v, then u j is a suffix of a word equivalent to u and v j is a suffix of a word equivalent to v. In addition, if u j = XY u and XY is a clean overlap prefix of u j , and v j = XY v and XY is a clean overlap prefix of v j , then Y u is a suffix of u and Y v is a suffix of v.
Proof. All line numbers in this proof refer to WpPrefix in [Kam11b]. Clearly, if WpPrefix(u j , v j , p j ) is the initial call to WpPrefix(u, v, ε), then u j = u, v j = v and hence u j and v j are suffixes of u and v, respectively. In addition if u j = u = XY u and XY is a clean overlap prefix of u j , and v j = v = XY v and XY is a clean overlap prefix of v j then clearly Y u is a suffix of u and Y v is a suffix of v. Assume that the result holds for the j-th recursive call WpPrefix(u j , v j , p j ) from within WpPrefix(u, v, ε). We will show that the result holds for the recursive call WpPrefix(u j+1 , v j+1 , p j+1 ) that occurs immediately after WpPrefix(u j , v j , p j ). Since the result holds for u j , there exists a word w ≡ u with suffix u j . If WpPrefix(u j+1 , v j+1 , p j+1 ) occurs in one of lines 15, 25, 28, 29, 31 or 42, then u j+1 is a suffix of u j and the result holds for u j+1 . If WpPrefix(u j+1 , v j+1 , p j+1 ) occurs in line 24 or 33 then u j = XY Zu for some u ∈ A * and u j+1 =Ẑu forẐ a complement of Z. Since u j = XY Zu ≡XŶẐu there exists a word equivalent to w with suffixXŶẐu and hence u j+1 =Ẑu is a suffix of a word equivalent to u. The proof that v j+1 is a suffix of a word equivalent to v is analogous.
It remains to show that if u j has clean overlap prefix XY and u j = XY u and v j has clean overlap prefix XY and v j = XY v , then Y u is a suffix of u and Y v is a suffix of v. We will prove this for u j , the proof for v j is analogous. As stated above, the result holds if WpPrefix(u j , v j , p j ) is the initial call to WpPrefix(u, v, ε) and u has clean overlap prefix XY . If this is not the case, then the algorithm will make a number of calls in line 15 until a suffix of u that has the form XY u with XY a clean overlap prefix is found and hence the result holds for Y u in this case as well. We will assume that the result holds for WpPrefix(u j , v j , p j ) such that u j has clean overlap prefix X j Y j and u j = X j Y j u j and we will show that is holds for the recursive call WpPrefix(u k , v k , p k ) that is the first recursive call after WpPrefix(u j , v j , p j ) such that u k has clean overlap prefix X k Y k and u k = X k Y k u . Since u j has clean overlap prefix X j Y j and u j ≡ v j , the recursive call to WpPrefix(u j+1 , v j+1 , p j+1 ) that occurs immediately after WpPrefix(u j , v j , p j ) occurs in one of lines 24, 25, 28, 29, 31, 33 or 42. If it occurs in line 25, 28, 29, 31 or 42, then u j+1 is a suffix of u j and since Y j u j is a suffix of u it follows that u j+1 is a suffix of u. Since WpPrefix(u k , v k , p k ) is the first recursive call after WpPrefix(u j , v j , p j ) such that u k has clean overlap prefix X k Y k and u k = X k Y k u , it follows that any recursive call after WpPrefix(u j+1 , v j+1 , p j+1 ) and before WpPrefix(u k , v k , p k ) can only occur in line 15. It follows that u k is a suffix of u j+1 and hence Y k u k is a suffix of u. If the recursive call to WpPrefix(u j+1 , v j+1 , p j+1 ) occurs in line 24 or 33, then u j = X j Y j Z j u , Y j Z j u is a suffix of u and u j+1 =Ẑ j u for some complementẐ j of Z j . Similar to the previous case, any recursive call after WpPrefix(u j+1 , v j+1 , p j+1 ) and before WpPrefix(u k , v k , p k ) can only occur in line 15. Hence u j+1 = aX k Y k u k for aX k Y k a clean overlap prefix of u j+1 . SinceẐ j is a prefix of u j+1 , |aX k | ≥ |Ẑ j |, otherwise a prefix of X k Y k Z k longer than X k would be a factor ofẐ j . It follows that Y k u k is a suffix of u and hence a suffix of u.
The following result holds as a direct consequence of Lemma 6.3.9 and the definition of special middle words and can be viewed as a tool to "identify" special middle words inside a word w in some cases. Lemma 6.3.10. Let u, v ∈ A * be such that u ≡ v. Assume that there exists a recursive call WpPrefix(XY u , XY v , p) from within WpPrefix(u, v, ε) for some p, u , v ∈ A * such that XY is a proper complement of XY . Then Y is a special middle word of u and Y is the corresponding special middle word of v.
Proof. Since u ≡ v it follows that WpPrefix(u, v, ε) returns Yes and hence since WpPrefix(XY u , XY v , p) is a recursive call from within WpPrefix(u, v, ε), WpPrefix(XY u , XY v , p) returns Yes as well. It follows that XY u ≡ XY v . Since XY is a proper complement of XY it follows by Lemma 3 in [Kam11b] that u ≡ Zu and v ≡ Zv for some u , v ∈ A * . In addition, it follows by Lemma 6.3.7 (ii) that XY is a clean overlap prefix of XY u and XY is a clean overlap prefix of XY v . By Lemma 6.3.9, XY u is a suffix of a word equivalent to u and hence there exists a word w such that w ≡ u and XY Zu is a suffix of w. It follows that Y is a special middle word of w and by Corollary 6.3.6 some complements of Y are the corresponding middle words in u and v. By Lemma 6.3.9, Y u is a suffix of u and Y v is a suffix of v and hence Y and Y are the corresponding special middle words in u and v, respectively. The next two results will be useful when comparing equivalent words based on the sequences of their special middle words. Lemma 6.3.11. Let u, v ∈ A * be such that u ≡ v. Assume that Y is a special middle word of u and let Y be the corresponding special middle word in v. Then there exists a recursive call WpPrefix(XY u , XY v , p) from within WpPrefix(u, v, ε) for some p, u , v ∈ A * .
Proof. All line numbers in this proof refer to WpPrefix in [Kam11b]. Assume that (Y 0 , Y 1 , . . . , Y n ) is the sequence of special middle words of u. By Lemma 6.3.
for all i and X i and X i are suffixes of X i and X i , respectively, for all i. We start by showing that the result holds for Y 0 and Y 0 . Since u = a 0 X 0 Y 0 u for some u ∈ A * and Y 0 is a special middle word of u, it follows by definition that u ≡ Z 0 u for some u ∈ A * . It follows by Lemma 6.3.7(ii) applied to X 0 Y 0 u that X 0 Y 0 is a clean overlap prefix of X 0 Y 0 u . Since there are no relation words contained in a 0 and a 0 is also a prefix of v, WpPrefix(u, v, ε) starts by making recursive calls in lines 15 and 28, until the prefix a 0 has been deleted and the recursive call to WpPrefix(X 0 Y 0 u , X 0 Y 0 v , p 0 ) occurs for some piece p 0 .
We now assume that there exists a recursive call to WpPrefix(X k Y k u k , X k Y k v k , p k ) from within Wp-Prefix(u, v, ε) for some p k , u k , v k ∈ A * for special middle words Y k and Y k , respectively, and we will prove that there exists a recursive call to WpPrefix(X k+1 Y k+1 u k+1 , X k+1 Y k+1 v k+1 , p k+1 ) from within WpPrefix(u, v, ε) for some p k+1 , u k+1 , v k+1 ∈ A * for special middle words Y k+1 and Y k+1 , respectively.
We will show this by examining the recursive calls that occur after WpPrefix(X k Y k u k , X k Y k v k , p k ). If Z k is not a prefix of u k then the recursive call immediately after WpPrefix(X k Y k u k , X k Y k v k , p k ) occurs in one of lines 28, 29, 31 and 42. Since Z k is not a prefix of u k and Y k u k is a suffix of u by Lemma 6.3.9, it follows by Lemma 6.3.8 that a k+1 X k+1 = a k+1 X k+1 . In addition, since Y k is a special middle word, u k is equivalent to a word with prefix Z k and hence it has a clean overlap prefix cXY with |c| < |Z k |. Since Z k is not a prefix of cX then it must be a prefix of cX for X a proper complement of X.
If the recursive call occurs in one of lines 28, 29 or 31 then the first argument is u k which in this case has a k+1 X k+1 Y k+1 as a prefix. If a k+1 X k+1 Y k+1 is a clean overlap prefix of u k the algorithm continues by making |a k+1 | recursive calls in line 15. Then since u ≡ v it makes the call WpPrefix(X k+1 Y k+1 u k+1 , V, p k+1 ) and V = X k+1 Y k+1 v k+1 for some p k+1 , u k+1 , v k+1 ∈ A * . In addition, since this call was proceeded by |a k+1 | calls in line 15 it follows that v k has the prefix a k+1 X k+1 Y k+1 and since there is no clean overlap prefix in a k+1 , it follows by Lemma 6.3.7(ii) that Y k+1 is the left most special middle word occurring to the right of Y k in v and hence it is the special middle word that corresponds to Y k+1 in u. Assume that a k+1 X k+1 Y k+1 is not a clean overlap prefix of u k . Since Y k+1 is a special middle word of u, it follows by Lemma 6.3.7 (ii) that X k+1 Y k+1 is a clean overlap prefix of a suffix of u k . It follows that the clean overlap prefix cXY of u k is such that |cXY | ≤ |a k+1 |, otherwise either cXY would overlap with X k+1 Y k+1 or it would contain X k+1 Y k+1 as a factor and both of these contradict the definition of a clean overlap prefix. It follows that cXY is such that |cXY | ≤ |a k+1 | and since Y is not a special middle word of u, cXY is not followed by Z and u k is only equivalent to words that have cXY as a prefix. It follows that in this case the algorithm makes recursive calls in lines 15 and 28 until a k+1 gets deleted. Similar to the previous case, a k+1 is a prefix of v k , the algorithm makes the recursive call WpPrefix(X k+1 Y k+1 u k+1 , X k+1 Y k+1 v k+1 , p k+1 ) for some p k+1 , u k+1 , v k+1 ∈ A * and Y k+1 is the special middle word of v that corresponds to Y k+1 in u.
If the recursive call occurs in line 42 then by the assumptions of this case a k+1 has prefix z 1 and any clean overlap prefix XY of u k begins after the end of z 1 . It follows that the result holds in this case following the same argument as in the case of lines 28, 29 and 31 applied to the suffix of u k that follows z 1 .
It remains to show that the result holds when Z k is a prefix of u k . In this case the recursive call immediately after WpPrefix(X k Y k Z k u k , X k Y k v k , p k ) occurs in one of lines 24, 25 or 33. In case the call occurs in line 33 the result holds by symmetry with line 31. If the recursive call occurs in line 25, then u k is notẐ k+1 -active for some complementẐ k+1 of Z k+1 and hence a k+1 has the same form as in the case of lines 28, 29 and 31. Hence the same argument can show that the result holds in this case.
If X k+1 is a proper suffix of X k+1 then u k isẐ k+1 -active for some complementẐ k+1 of Z k+1 , otherwise no word equivalent to u would contain X k+1 Y k+1 as a factor in this position. It follows that in this case the algorithm makes a recursive call in line 24 andẐ k+1 u k has the clean overlap prefix bX k+1 Y k+1 for some b ∈ A * with |b| < |Ẑ k+1 |. It follows that the algorithm makes |b| recursive calls in line 15 and the result holds for Y k+1 . If u k isẐ k+1 -active but the clean overlap prefix ofẐ k+1 u k is cXY with Y = Y k+1 , then a k+1 has the same form as in the cases of lines 25, 28, 29 and 31 and the result holds in this case.
The following result holds as a corollary of Lemma 6.3.11. Corollary 6.3.12. Let u, v ∈ A * be such that u ≡ v. Assume that Y k and Y k+1 are consecutive special middle words of u. Assume that u k , u k+1 , v k , v k+1 are such that WpPrefix(u k , v k , p k ), WpPrefix(u k+1 , v k+1 , p k+1 ) are the recursive calls corresponding to Y k , Y k+1 from within WpPrefix(u, v, ε) from Lemma 6.3.11. If WpPrefix(u j , v j , p j ) is a recursive call from within WpPrefix(u, v, ε) that occurs after WpPrefix(u k , v k , p k ) and before WpPrefix(u k+1 , v k+1 , p k+1 ) and it is not the recursive call that occurs immediately after Wp-Prefix (u k , v k , p k ), then WpPrefix(u j , v j , p j ) occurs either in line 15 or line 28 of WpPrefix.
Proof. All line numbers in this proof refer to WpPrefix in [Kam11b]. Assume that WpPrefix(u j , v j , p j ) is the recursive call from within WpPrefix(u, v, ε) as described in the statement of this lemma. If WpPrefix(u j , v j , p j ) occurs in one of lines 31, 33 or 42 then u j = XY u where XY is a clean overlap prefix of u j and v j = XY v where XY is a clean overlap prefix of v j and u , v ∈ A * and XY is a proper complement of XY . It follows by Lemma 6.3.10 that there exists a special middle word in a word equivalent to u that occurs between Y k and Y k+1 , a contradiction. In addition, if WpPrefix(u j , v j , p j ) occurs in lines 24, 25 or 29 then either u j = XY Zu for some relation word XY Z and some u ∈ A * or u j = XY u and Z is a possible prefix of u . By Lemma 6.3.9 there exists a word equivalent to u containing a relation word XY Z as a factor and hence Y is a special middle word. It follows that there exists a special middle word in a word equivalent to u that occurs between Y k and Y k+1 , a contradiction. It follows that WpPrefix(u j , v j , p j ) can only occur in line 15 or 28.
We are now ready to prove the following lemma. Theorem 6.3.4 will hold as a corollary of Lemma 6.3.13. In addition, Lemma 6.3.13 will be used as a tool to compare prefixes of equivalent words when proving the correctness of the algorithm. Lemma 6.3.13. Suppose that u, v ∈ A * are such that u ≡ v, that u = u 0 Y 0 · · · u m Y m u m+1 , and that v = v 0 Y 0 · · · v n Y m v m−1 where Y i are the special middle words in u and Y i are the special middle words in v.
Proof. All line numbers in this proof refer to WpPrefix in [Kam11b]. It follows by Lemma 6.3.7(iv) and Lemma 6.3.8, that u 0 Y 0 = a 0 X 0 Y 0 and a 0 is a prefix of every word equivalent to u.
Assume that the result holds for Y 0 = Y 0 , . . . , Y k−1 = Y k−1 for some k ≥ 1. Since u ≡ v and Y k is a special middle word of u, it follows by Lemma 6.3.11 that there exist recursive calls to WpPrefix(u k−1 , v k−1 , p k−1 ) and the next recursive call to WpPrefix within WpPrefix(u k−1 , v k−1 , p k−1 ) must occur in one of lines 24, 25, 28, or 29 (in the other cases the prefix of u k−1 is a proper complement of the prefix of v k−1 ). The only subsequent type of recursive call that can occur between WpPrefix(u k−1 , v k−1 , p k−1 ) and WpPrefix(u k , v k , p k ) is in lines 15 and 28 by Corollary 6.3.12. Hence u k = v k since in the recursive calls of lines 15 and 28 equal prefixes of the first two arguments are deleted. It follows that Having established Theorem 6.3.4, in the next 2 lemmas we consider how the special middle words u, v ∈ A * such that u ≡ v interact with the lexicographic order.
Lemma 6.3.14. Suppose that u, v ∈ A * , that u ≡ v, and that u = pY k−1 Z k−1 X k Y k q for special middle words Y k−1 and Y k in u, some p, q ∈ A * and some proper suffix X k of X k . If Y k−1 and a proper complement Y k of Y k are the corresponding middle words in v, then there is a suffix X k of a complement Z k−1 of Z k−1 such that X k = X k X k and X k is also a prefix of X k .
Proof. Since Y k−1 , Y k are special middle words of u, there is a rewrite sequence u * for some q 0 ∈ A * . Since Z k−1 is a factor of v, but not of p X k−1 Y k−1 zX k Y k Z k q 0 by the assumption of this case, X k = X k X k and so X k = X k is the required common prefix.
In the next lemma, we make use of the following observation: if u, v ∈ A * are such that u < v and u is not a prefix of v, then uw < vw for all w, w ∈ A * . Lemma 6.3.15. Suppose that u, v ∈ A * are such that u ≡ v. If u < v and Y k is the left most special middle word in u such that the corresponding middle word Y k in v is a proper complement of Y k , then Proof. Let u := w 0 , w 1 , . . . , w n := v be any rewrite sequence. By Lemmas 6.3.8 and 6.3.13, there exist a k , b k , p, q, q ∈ A * such that u = pX k−1 Y k−1 a k X k Y k q and v = pX k−1 Y k−1 b k X k Y k q where X k and X k are (not necessarily proper) suffixes of X k and X k , respectively. Since Y k is a special middle word of u, it follows by Lemma 6.3.5 (ii) that there exists j ∈ {0, . . . , n} such that X k Y k Z k is a factor of w j and X k Y k Z k is a factor of w j+1 . It remains to show that X k Y k Z k < X k Y k Z k .
It follows by Lemma 6.3.9 and Lemma 6.3.11 that there exists a recursive call to WpPrefix(X k−1 Y k−1 a k X k Y k q, X k−1 Y k−1 b k X k Y k q , t) for some piece t, from within WpPrefix(u, v, ε). In addition, X k−1 Y k−1 is a clean overlap prefix of the first two arguments of this call by Lemma 6.3.7 (i) and (ii). Since the first two arguments of this call begin with the clean overlap prefix X k−1 Y k−1 , the recursive call occurring immediately after WpPrefix(X k−1 Y k−1 a k X k Y k q, X k−1 Y k−1 b k X k Y k q , t) must occur in one of lines 24, 25, 28, 29 and the only possible recursive calls that can occur before WpPrefix(X k Y k q, X k Y k q , t ) for some piece t , are those in lines 15 or 28 by Corollary 6.3.12. Since in the recursive calls of lines 15 and 28 equal prefixes of the first two arguments are deleted, it follows that We will show that this implies that X k Y k < X k Y k and hence X k Y k Z k < X k Y k Z k . There are two cases to consider: when X k = X k and when X k is a proper suffix of X k . If X k = X k then it follows that X k = X k , since otherwise there would not be a recursive call to WpPrefix(X k Y k q, X k Y k q , t ) from within WpPrefix(u, v, ε), which contradicts Lemma 6.3.11. It follows that Suppose that X k is a proper suffix of X k . It follows by Lemma 6.3.8 that a k = b k = Z k−1 , and so In this case, as in the previous case, it follows that X k Y k < X k Y k . By Lemma 6.3.14, if X k and X k are such that X k = X k X k and X k = X k X k , then We are now ready to prove the correctness of the NormalForm algorithm. We have already shown that the output of NormalForm(w 0 ) is a word equivalent to w 0 in Corollary 6.2.2. By Theorem 6.3.4 it suffices to show that if v n is the output of NormalForm(w 0 ), then the sequences of special middle words of v n and min w 0 are identical. We accomplish this in the next two lemmas.
Lemma 6.3.16. Let w 0 be the input to NormalForm(w 0 ) and let v i with i ≥ 0 be the value of v after the i-th iteration of the while loop starting in line 2. Then for every special middle word Y k in w 0 there exists an i such that v i contains a complement of Y k .
Proof. All line numbers in this proof refer to NormalForm in Algorithm 2. We will show that there exists an i such that during the i-th iteration of the while loop in NormalForm one of the following holds: We proceed by induction on the number of special middle words in w 0 . By Theorem 6.3.4, if there are no special middle words in w 0 , then the only word equivalent to w 0 is itself. In particular, if there are no special middle words in w 0 , then neither w 0 nor any word equivalent to w 0 contains a relation word as a factor, by Lemma 6.3.5. It follows that either w 0 does not have a clean overlap prefix or if w 0 has a clean overlap prefix cXY for some c ∈ A * such that w 0 = cXY w for w ∈ A * , then WpPrefix(w , w , Z)=No because otherwise there exists a word equivalent to w 0 that contains XY Z as a factor in the obvious place. This is a contradiction since there are no special middle words in w 0 . It follows that in every iteration of the while loop in line 2 the conditions of line 3 are not satisfied. If the conditions of line 18 are satisfied, then the condition of line 19 is satisfied as well and if the conditions of line 18 are not satisfied then we have the case of lines 29-30. In particular, in every iteration of the while loop in line 2 v i w i = w 0 and hence the algorithm returns w 0 , as required.
Suppose that w 0 contains at least one special middle word. Let w 0 = a 0 X 0 Y 0 q where Y 0 is the left most special middle word of w 0 and a 0 , q ∈ A * . Since Y 0 is a special middle word, WpPrefix(q, q, Z 0 )=Yes by definition and hence X 0 Y 0 is a clean overlap prefix of X 0 Y 0 q by Lemma 6.3.7 (ii). Since Y 0 is the left most special middle word of w 0 , the algorithm begins by finding the clean overlap prefix aXY of w 0 such that w 0 = aXY w . If aXY = a 0 X 0 Y 0 , then (ii) holds for Y 0 . If aXY = a 0 X 0 Y 0 , then |aXY | < |a 0 X 0 Y 0 | because otherwise X 0 Y 0 would be a factor of aXY , which contradicts the fact that X 0 Y 0 is a clean overlap prefix of X 0 Y 0 q. In addition, X 0 Y 0 does not overlap with a suffix of aXY because aXY is clean. In this case, aXY satisfies the conditions of line 18 and 19 and w 1 = w . Since W is assigned to be equal to ε in line 20 and since no clean overlap prefix of w overlaps with X 0 Y 0 , the same steps as in the first iteration of the while loop are repeated until w i = bX 0 Y 0 w i for some b, w i such that (ii) is satisfied for Y 0 .
We assume that (i) or (ii) holds for the special middle words Y 0 , . . . , Y k−1 of w 0 for some k ≥ 1. We will show that (i) or (ii) holds for Y k also.
Let v j w j be the word equivalent to w 0 after the j-th iteration of the while loop in line 2. Suppose that either (i) or (ii) was satisfied for Y k−1 during the j-th iteration of the while loop. In this case, v j is defined in one of lines 11, 15, or 25, and in any of these cases: v j = pX k−1 Y k−1 and w j = b k X k Y k q by Lemma 6.3.8 for some p, b k , q ∈ A * , X k−1 and X k are suffixes of X k−1 and X k , respectively, and Z k−1 is a prefix of b k since w j was assigned in one of lines 12, 16 or 26. Since Y k is a special middle word of v j w j it follows by definition that WpPrefix(q, q, Z k ) =Yes. In addition, by Lemma 6.3.8, either X k = X k ; or X k is a proper suffix of X k and b k = Z k−1 . If X k is a proper suffix of X k and b k = Z k−1 , then, since Y k is a special middle word of v j w j , X k Y k q is either Z k−1 -active or Z-active for some proper complement Z of Z k−1 since there exists a word equivalent to v j w j containing X k Y k Z k as a factor in the obvious place. If the latter holds, then the conditions of line 3 are satisfied. In particular, we have that W = X k−1 Y k−1 Z k−1 and that Z k−1 is a prefix of w j since v j was assigned in one of lines 11, 15 or 25. In addition WpPrefix(q, q, Z k ) = Yes since Y k is a special middle word. Hence (i) holds for Y k . If the former holds, then the conditions of line 3 are not satisfied and since X k Y k q is Z k−1 -active it follows by definition that zX k Y k is a clean overlap prefix of w j = b k X k Y k q for some z ∈ A * with |z| < |b k |. Hence (ii) holds for Y k .
In the case that X k = X k then w j = Z k−1 c k X k Y k w for some c k ∈ A * . If the conditions of line 3 are satisfied for w j then c k X k Y k w is Z-active for some proper complement Z of Z k−1 , c k X k Y k w = X s Y s w for some w ∈ A * and WpPrefix(w , w , Z s )= Yes and hence there exist p, q ∈ A * such that v j w j = pY s q and a word p X s Y s Z s q with p X s ≡ p, Z s q ≡ q. But this implies that Y s is a special middle word of w 0 that occurs between Y k−1 and Y k , a contradiction. It follows that in this case the conditions of line 3 cannot be satisfied. Since w j = Z k−1 c k X k Y k w and Y k is a special middle word of v j w j , WpPrefix(w , w , Z k ) = Yes and hence X k Y k is a clean overlap prefix of X k Y k w by Lemma 6.3.7 (ii). It follows by the same argument applied to prove the base case that either c k X k Y k is a clean overlap prefix of w j or w j has a clean overlap prefix cXY such that |cXY | ≤ c k and hence (ii) holds for Y k . Proof. All line numbers in this proof refer to NormalForm in Algorithm 2. The value of W i gets assigned in one of lines 10, 14, 20 and 24. In line 24, W ← ε and hence it suffices to examine the cases of lines 10, 14 and 24. In each of these cases, W i ← XY Z for some relation word XY Z and v i ← v Y for some v ∈ A * . It follows that Y is a factor of v i w i .
We prove that Y is a special middle word of v i w i by induction. Assume that the k-th iteration of the while loop of line 2 is the first iteration of NormalForm(w 0 ) such that a value not equal to ε gets assigned to W k . It follows that W k−1 did not satisfy the condition of line 3 and the values of W k , v k , w k get assigned in lines 24, 25 and 26, respectively. Hence w k−1 has a clean overlap prefix aXY for some a ∈ A * , w k−1 = aXY w and Z is a possible prefix of w . It follows that Y is a factor of v k−1 w k−1 ≡ v k−1 aXY Zq ≡ v k−1 aX Y Z q = v k w k for X Y Z the lexicographically minimal equivalent of XY Z and for some q ∈ A * . Hence, by definition, Y is a special middle word of v k w k .
We now assume that the result holds for the first j iterations of the while loop of line 2 and assume that m ∈ N is such that the (j + m)-th iteration of the while loop of line 2 is the first iteration after the j-th iteration of NormalForm(w 0 ) such that a value not equal to ε gets assigned to W j+m . Then the value of W j+m gets assigned in one of lines 10, 14 or 24. In the cases of lines 10 and 14, W j+m−1 = ε and hence W j+m−1 = W j−1 and W j+m = W j . It follows that W j−1 = X r Y r Z r and v j−1 w j−1 = pY r q for some p, q ∈ A * and v j−1 w j−1 ≡ p X r Y r Z r q for some p , q ∈ A * . Since the value of W j gets assigned in one of lines 10 and 14, w j−1 satisfies the conditions of line 3. In particular, w j−1 = Z r w , w is Z r -active, for Z r a proper complement of Z r and Z r w = aX s Y s w and Z s is a possible prefix of w . It follows that v j−1 w j−1 ≡ p X r Y r Z r q = p X r Y r aX s Y s w . Since Z s is a possible prefix of w , it follows that X s Y s Z s is a factor of a word equivalent to v j−1 w j−1 and hence Y s is a special middle word of v j−1 w j−1 . If the values of W j , v j , w j get assigned in lines 14, 15 and 16, respectively, then v j−1 w j−1 = v j w j , W j = X s Y s Z s and hence Y s is a special middle word of v j w j . If the values of W j , v j , w j get assigned in lines 10, 11 and 12, respectively, then W ← X t Y t Z t is a complement of X s Y s Z s and since Y s is a special middle word of v j−1 w j−1 , it follows by Lemma 6.3.5 that Y t is a special middle word of v j w j .
In the case of line 24 the result follows by an argument that is identical to the argument in the proof of the base case of this proof. Lemma 6.3.18. Let w 0 be the input to NormalForm(w 0 ) and let v i with i ≥ 0 be the value of v after the i-th iteration of the while loop in line 2. Then v i is a prefix of the lexicographically minimal word min w 0 equivalent to w 0 for every i.
Proof. All line numbers in this proof refer to NormalForm in Algorithm 2. Certainly, v 0 = ε is a prefix of the lexicographically least word min w 0 equivalent to w 0 . Assume for j ≥ 1 that v j−1 is a prefix of min w 0 . We will show that v j is also a prefix of min w 0 . Since v j−1 is a prefix of min w 0 , the special middle words in v j−1 are the initial k special middle words in min w 0 for some k. The value of v j is assigned in one of lines 11, 15, 21, 25, and 29 and in every case v j−1 is a prefix of v j . We consider each of these cases separately.
line 11: If v j is defined in line 11, then the conditions of lines 3, 4 and 9 are satisfied. Since the conditions of line 3 are satisfied, W j−1 = X r Y r Z r and by Lemma 6.3.17, Y r is a special middle word of v j−1 w j−1 . Since the value of W j−1 is assigned such that Y r is a suffix of v j−1 , it follows that Y r = Y k−1 . In addition, since the value of W j gets assigned in line 10, it follows by Lemma 6.3.17 that Y s = Y k in line 3 and Y t = Y k in line 11. It suffices by Lemma 6.3.13 to prove that the special middle word Y k in v j is equal to the complement of Y k in min w 0 .
Since min w 0 ≡ v j w j , it follows that min w 0 ≤ v j w j . If the k + 1-th special middle word Y k in min w 0 is not Y k , then min w 0 < v j w j , and so, by Lemma 6.3.15, X k Y k Z k < X k Y k Z k . If a is the suffix of Z k−1 given in line 3, then X k Y k Z k is chosen in line 5 to be the least complement of X k Y k Z k with prefix a. Seeking a contradiction we will show that X k Y k Z k also has a prefix a. In order to accomplish this, we show that min w 0 and v j w j satisfy the assumption of Lemma 6.3.14. In other words, we will show that v j w j = pY k−1 Z k−1 X k Y k q for some p, q ∈ A * and some suffix X k of X k .
The word v j−1 was defined to be v Y k−1 for some v . Since the condition in line 3 holds, w j−1 = Z k−1 w for some w , and Z k−1 w = bX k Y k w for some b ∈ A * such that |b| < |Z k−1 |. This implies that X k = aX k where a is the suffix of Z k−1 given in line 3, and X k is a prefix of w . Hence, since w is Z k−1 -active, for some q ∈ A * . Hence, by Lemma 6.3.14, the word a is a prefix of both X k and X k , giving the required contradiction.
line 15: Similar to the case of line 11, and we must show that Y k is in min w 0 . If the conditions of line 3 are satisfied but the conditions of line 4 are not satisfied, then X k Y k Z k is the lexicographically minimum relation word with prefix a. If min w 0 does not contain Y k , then it contains a proper complement Y k . As in the previous case, v j w j = v Y k−1 Z k−1 X k Y k q, and so by Lemma 6.3.14 (applied to v j w j and min w 0 ) the word a is a prefix of both X k and X k . Thus it follows by Lemma 6.3.15 that has prefix a and the condition of line 4 is satisfied, a contradiction. If the conditions of lines 3 and 4 are satisfied but the condition of line 9 is not satisfied, then WpPrefix(w 0 , v j−1 Z k−1 X k Y k Z k t, ε) = No. Hence no word equivalent to w 0 contains Y k−1 and a proper complement Y k of Y k where X k Y k Z k has prefix a. But, by Lemma 6.3.14, every word equivalent to w 0 that contains Y k−1 and a proper complement of Y k , has the property that the proper complement of Y k is the middle word of a relation word with prefix a. Hence no word equivalent to w 0 contains both Y k−1 and a proper complement of Y k . In particular, min w 0 contains Y k , as required.
line 21: In this case, v j = v j−1 aXY , w j−1 = aXY w and WpPrefix(w , w , Z)=No. It follows that Y is not a special middle word of v j−1 w j−1 . By assumption v j−1 is a prefix of min w 0 and by Lemma 2.1.4 aXY is a prefix of all words equivalent to w j−1 . It follows that v j = v j−1 aXY is a prefix of min w 0 .
line 25: In this case, v j = v j−1 aXY , w j−1 = aXY w and WpPrefix(w , w , Z) = Yes. It follows that there exists a word equivalent to v j−1 w j−1 containing XY Z as a factor and hence Y is a special middle word of v j−1 w j−1 . Since v j−1 contains the initial k special middle words of min w 0 and Y is a special middle word occurring after Y k−1 , it follows that Y = Y k . In this case, v j = v j−1 aX k Y k and by Lemma 6.3.13 it suffices to show that Y k is a factor of min w 0 . In this case, w j−1 = b k X k Y k w and Z k is a possible prefix of w . By Lemma 6.3.15, min w 0 must contain the middle word in min X k Y k Z k . Hence min w 0 contains Y k , as required.
line 29: In this case, neither of the conditions in lines 3 or 18 are satisfied. Since the condition of line 18 is not satisfied, w j−1 contains no relation words as factors and it is only equivalent to itself. Hence v j = v j−1 w j−1 is the normal form of w 0 .
The proof of the correctness of NormalForm is concluded in the following proposition.
Proposition 6.3.19. If w 0 ∈ A * is arbitrary, then the word v returned by NormalForm(w 0 ) is the lexicographical least word equivalent to w 0 .
Proof. All line numbers in this proof refer to NormalForm in Algorithm 2. In Corollary 6.2.2 it is shown that the word returned by NormalForm is equivalent to w 0 . We will use the same notation as in the proof of Corollary 6.2.2; v i , w i will be used to denote the values of v and w after the i-th iteration of the while loop in line 2.
In Lemma 6.3.16, we showed that for every special middle word Y k in w 0 there exists an i such that v i contains a complement of Y k . Since v i is a proper prefix of v i+1 for every i, it follows that eventually v i contains a complement of every special middle word in w 0 . In Lemma 6.3.18, we showed that v i is a prefix of min w 0 for all i. Together these two statements imply that when NormalForm terminates, the middle words in v i coincide with the special middle words in min w 0 and hence v i is the lexicographically least word equivalent to w 0 by Theorem 6.3.4.

Complexity
In this section we analyze the complexity of NormalForm. Throughout this section we suppose that the maximal piece prefix X, suffix Z, and middle word Y has been computed already for every relation word in the given presentation A|R . The time complexity for doing this is discussed in Section 3. As such we do not include the complexity of determining that the presentation satisfies C(4), nor that of finding the X, Y , and Z, in the statements in this section. We start with two results regarding the complexity of finding a clean overlap prefix for a word w and deciding if a word w is p-active for a piece p. Finally, we show that for a given C(4) presentation A | R , the complexity of NormalForm(w) is O(|w| 2 ) where w ∈ A * is the input.
Lemma 6.4.1. If w ∈ A * is arbitrary, then the clean overlap prefix of w, if any, can be found in time linear to the length of w.
Proof. Let M denote the number of relation words and let δ be the length of the longest relation word in our C(4) presentation. According to Lemma 7 in [Kam09a] to check if a word v has a clean overlap prefix of the form X i Y i where X i Y i Z i = W i , 1 ≤ i ≤ r it suffices to check if v has a clean overlap prefix of this form, where v is a prefix of v such that |v| = 2δ.
Hence, in order to find the clean overlap prefix of w that has the form sX i Y i for s ∈ A * it suffices to check at most |w| suffixes of w for clean overlap prefixes of the form X i Y i . This can be done in O(|w|) time.
Lemma 6.4.2. If w ∈ A * is arbitrary and p is a piece, then deciding if w is p-active takes constant time.
Proof. Again, let M denote the number of relation words and let δ be the length of the longest relation word in our C(4) presentation. According to Lemma 7 in [Kam09a] it suffices to check if w is p-active, where w is a prefix of w of length 2δ. Since p is a piece, then clearly |p| < δ. A string searching algorithm, such as, for example, Boyer-Moore-Horspool [Hor80], can check if there exists some i, 1 ≤ i ≤ M such that the factor X i Y i occurs in pw before the end of p. This takes O(M δ|pw |) = O(3M δ 2 ) time.
Proposition 6.4.3. The complexity of NormalForm is O(|w 0 | 2 ) where w 0 ∈ A * is the input, given that the decompositions of the relation words in the presentation into XY Z are known.
Proof. Let A | R be the presentation, let M be the number of distinct relation words in R and let δ be the length of the longest relation word in R. We have already shown that the while loop of line 2 will be repeated at most |w 0 | times. We analyze the complexity of each step of the procedure in the loop.
In line 3 the algorithm tests if the word w is Z r -active for Z r some complement of Z r in constant time. In addition, checking if Z r = Z r requires comparing at most δ characters. Finding the suffix a of Z r such that aw = X s Y s w also requires checking at most δ characters, hence these checks can be performed in constant time.
In lines 3,9 and 19, WpPrefix(u, v, p) is called. According to [Kam09a], the algorithm can be implemented with execution time bounded above by a linear function of the length of the shortest of the words u and v. Since every time WpPrefix(u, v, p) is called either u = w 0 or u is a suffix of some word equivalent to w 0 , this step can be executed in O(δ|w 0 |) time.
In lines 4-5 we search for proper complements of X s Y s Z s that have the prefix a. Since a is a piece, this step also requires constant time.
In line 9 the algorithm finds the suffix b of X s such that X s = ab and the suffix t of ReplacePrefix(w , Z s ) that follows Z s . This is also done in constant time since a and Z s are pieces.
In lines 9, 16 and 26 Algorithm 1 is called. Each time, Algorithm 1 takes as input a suffix of some word s equivalent to w 0 . Since |s| < δ|w 0 |, this step can be completed in O(δ|w 0 |) time.
In lines 5 and 25 we search for the lexicographically minimal complement of some relation word X i Y i Z i . Clearly, this check can be done by comparing at most δ characters M times, hence it is constant for a given presentation.
In line 18 the algorithm finds the clean overlap prefix of w. As shown in Lemma 6.4.1, this can be done in O(|w|) time. Since w is always a suffix of some word equivalent to w 0 , this step can be executed in O(δ|w 0 |) time.
In lines 11-12, 15, 21-22, 25-26 and 29 Algorithm 2 concatenates v with a word of length at most δ|w 0 | and deletes a prefix of length at most δ|w 0 | from w. Hence these steps require at most 2δ|w 0 | time.
We end the paper with an example of the application of NormalForm to specific C(4) presentation.
Example 6.4.4. Let a, b, c, d | ab 3 a = cdc be the presentation and let w 0 = cdcdcab 3 ab 3 ab 2 cd. The set of relation words of the presentation is {ab 3 a, cdc} and each relation word has a single proper complement. The set of pieces of P is P = {ε, a, b, c, b 2 }. Let W 0 = ab 3 a, W 1 = cdc. Clearly, X W0 = a, Y W0 = b 3 , Z W0 = a and X W1 = c, Y W1 = d, Z W1 = c. Algorithm 2 begins with v ← ε, W ← ε and w ← cdcdcab 3 ab 3 ab 2 cd. Since u = ε the conditions of line 3 are not satisfied. The word w has a clean overlap prefix X W1 Y W1 = cd followed by Z W1 = c hence WpPrefix(cdcab 3 ab 3 ab 2 cd, cdcab 3 ab 3 ab 2 cd, c) returns Yes and ReplacePrefix(cdcab 3 ab 3 ab 2 cd, c) returns cdcab 3 ab 3 ab 2 cd. Since W 0 < W 1 , v ← X W0 Y W0 = ab 3 , w ← adcab 3 ab 3 ab 2 cd and W ← ab 3 a in lines 24-26. Now W = ab 3 a, w begins with Z W0 = a, w = dcab 3 ab 3 ab 2 cd is Z W1 -active and the prefix Y W1 = d of w is followed by Z W1 = c, hence the conditions in line 3 are satisfied. In addition, ab 3 a < cdc but X W0 and X W1 do not have a common prefix. Hence, in lines 14-16 v ← ad, w ← ReplacePrefix(cab 3 ab 3 ab 2 cd, c) = cab 3 ab 3 ab 2 cd and W ← cdc.
At this point, W = cdc and w = cab 3 ab 3 ab 2 cd begins with Z W1 = c but ab 3 ab 3 ab 2 cd is not Z W0 -active. The word w has the clean overlap prefix cab 3 that is followed by a, hence in lines 24-25, v ← vcab 3 , w ← ReplacePrefix(ab 3 ab 2 cd, a) = ab 3 ab 2 cd and W ← ab 3 a.
Next, W = ab 3 a and w = ab 3 ab 2 cd begins with Z W0 and b 3 ab 2 cd is not Z W1 -active but it has the clean overlap prefix ab 3 . The clean overlap prefix is followed by a, hence in lines 24-25 v ← vab 3 , w ← ab 2 cd and W ← ab 3 a.
Finally, W = ab 3 a, w = ab 2 cd begins with Z W0 but b 2 cd is not Z W1 -active. Now w has the clean overlap prefix ab 2 cd that is followed by ε, hence in line 19 WpPrefix(ε, ε, c) returns No and in lines 20-22 v ← vab 2 cd, W ← ε and w ← ε. Since w = ε, the algorithm returns v = ab 3 adcab 3 ab 3 ab 2 cd.
Next, we will apply NormalForm, to find the normal form of w 0 = cdab 3 cdc. We begin with v ← ε, W ← ε and w ← cdab 3 cdc. Since W = ε we do not have the case of line 3. The word w has the clean overlap prefix cd of the form X W1 Y W1 and WpPrefix(ab 3 cdc, ab 3 cdc, c)=Yes, and ReplacePrefix(ab 3 cdc, c) returns cdcb 3 a. In lines 24-26 v ← ab 3 , w ← adcb 3 a and W ← ab 3 a.
For this iteration, u = ab 3 a and w begins with Z W0 = a, dcb 3 a is Z W1 -active and clearly WpPrefix(cb 3 a, cb 3 a, c)=Yes. In addition, ab 3 a < cdc but X W0 and X W1 do not have a common prefix, hence in lines 14-16 v ← vad, w ← ReplacePrefix(cb 3 a, c)= cb 3 a and W ← cdc.
At this stage, W = cdc, w = cb 3 a begins with Z W1 = c and b 3 a is Z W0 -active but X W0 and X W1 do not have a common prefix hence v ← cb 3 , w ←ReplacePrefix(a, a) = a, W ← ab 3 a in lines 14-16.