Algorithmic properties of some fragments of concatenation theory

The paper considers two fragments of the word theory with concatenation. The first fragment has two relations denoting that one of the words is a prefix (respectively, a suffix) of another one. It is proved that this theory is algorithmically equivalent to elementary arithmetic and, therefore, undecidable. The second fragment has a countable set of power operations. It is proved that this theory admits effective quantifier elimination and is, therefore, decidable.


Introduction
The study of the concatenation theory began in [1] where a second order axiomatization of theory of syntax based on concatenation was given. After that the concatenation theory and its different variants were studied in many papers. A detailed review of history of this research can be found in [2]. In particular, in [3] it was proved that some finitely axiomatizable concatenation theory is undecidable for a two-symbol alphabet, and in [4] it was proved that it is essentially undecidable. In [2,5] it was established that some variant of Robinson arithmetic can be interpreted in this theory. At the same time it is easy to see that in some particular cases the concatenation theory can be decidable. For example, if the alphabet Σ contains a single symbol a, then the set of all words Σ * is isomorphic to the set of natural numbers with addition since a x · a y = a x+y . Therefore, the theory of this structure coincides with Presburger arithmetic and is, therefore, decidable. More difficult results on decidability and undecidability of different fragments of concatenation theory were obtained in [6,7]. In [8][9][10] some results were established on decidability of theories of some structures when the universe in not the set of all words but some set of all languages over some alphabet. In particular, it was proved that the theory of regular languages with concatenation only is undecidable for all alphabets.
In this paper we study the algorithmic complexity of two fragments of the word theory with concatenation. Section 2 contains main definitions. In Section 3 we study the theory T 1 such that its language contains only the symbols of constants and two predicate symbols Pref and Suf. The relations Pref(x, y) and Suf(x, y) mean that the word x is a prefix or a suffix of the word y, respectively. We prove that the theory of this structure is undecidable and equivalent to elementary arithmetic. In Section 4 we study the theory T 2 with the language containing power operations x i = x · . . . · x 3. Undecidability of the word theory with prefix and suffix relations In this section we study the theory T 1 of the structure with the universe Σ * for some alphabet Σ = { a 1 , a 2 , . . . , a r }, r ≥ 2, and with the language Ω 1 = (Pref (2) , Suf (2) ; a • Pref(x, y) -x is a prefix of y; • Suf(x, y) -x is a suffix of y; • a i -the symbol a i from the alphabet Σ.
At first we prove definability of some additional operations and relations in the theory T 1 . The relation Subw(x, y) means that the word x is a subword of the word y: If x is a subword of y, then y = vxw for some v, w. Then we may choose u = vx, so the formula is true. If the formula is true, then y = uw, u = vx for some v, w. Then y = vxw, i.e.
x is a subword of y.
The relation Subw 1 (x, y) means that y has exactly one occurrence of x: If y has an occurrence of x, then the formula Subw(x, y) is true. Let us suppose that the formula from the right-hand side of the equivalence is false, i.e. that there exist two different words u and v such that the formula Pref(u, Then y = uz 1 , y = vz 2 , u = w 1 x, v = w 2 x for some z 1 , z 2 , w 1 , w 2 ; therefore, y = w 1 xz 1 , y = w 2 xz 2 . Since u = v, then z 1 = z 2 and w 1 = w 2 . This means that there are at least two different occurrences of x in y; contradiction. Now let us suppose that the formula is true. Since Subw(x, y) is true, then x is a subword of y. Let us suppose that there are at least two different occurrences of x in y. Then y = w 1 xz 1 , y = w 2 xz 2 for some words ) is true, and the whole formula is false; again contradiction.
The main definable operation is concatenation of a word with an arbitrary fixed word. At first we define concatenation of a word and a symbol, y = xa and y = ax where a ∈ { a 1 , . . . , a r }: We prove the correctness of the formula for the relation y = xa. The proof for y = ax is analogous. If y = xa, then obviously x = y, x is a prefix of y, and a is a suffix of y. Let z be an arbitrary prefix of y. If z = y, then the conclusion of the implication is true. If z = y, then z does not contain the ending symbol a; therefore, z is a prefix of x. Now let us suppose that the formula from the right-hand side of the equivalence is true. Then y begins with x and ends with a. Therefore, y = xwa for some word w since y = x. If w = ε, then let z be the word xw. The word z is a prefix of y, it is not a prefix of x, and z = y. Therefore, the implication is false. This contradiction shows that w = ε and so y = xa. Now we define concatenation with a word w, y = xw and y = wx where w ∈ { a 1 , . . . , a r } * . We use induction on the length of w: Let us emphasize that this operations are defined only for the case when the word w is fixed. This construction does not allow to express concatenation of two variables. Now we show how to describe computations of an arbitrary deterministic Turing machine using formulas of the language Ω 1 . We write the configurations of the Turing machine M = (Q, ∆, P, q 0 , q f ) as Post words, i.e. as words of the form #uqav# where q ∈ Q is current state, a ∈ ∆ is an observed symbol, # / ∈ Q∪∆ is a special symbol (marker of the used part of the tape), u, v ∈ ∆ * are words written on the tape to the left and to the right of the observed cell. Here u does not begin with the empty symbol Λ, v does not end with Λ. The relation α M β means that the machine M moves in one step from the configuration α to the configuration β. It is known (see [11]) that for every Turing machine M = (Q, ∆, P, q 0 , q f ) one can effectively construct a semi-Thue system T = (Q ∪ ∆ ∪ { # }, U ) such that for every two Post words α and β the following property holds: α n M β if and only if α ⇒ n T β. Moreover, both used production and a place of its application are defined unambiguously for every step of the derivation α ⇒ n T β. In the proof of the following lemma we assume that the semi-Thue system is fixed. Lemma 1. Let the relation Der(x, y) mean that x ⇒ * T y for the semi-Thue system T . Then Der is definable in the theory T 1 .
Proof. We assume that the alphabet Σ contains all the symbols of the semi-Thue system and a special symbol $, i.e. that Q ∪ ∆ ∪ { #, $ } ⊆ Σ. Later in the proof of the main theorem we will show how to encode the computations using only two symbols. The derivation Let us define some relations which describe derivability in the semi-Thue system. The relation Step α→β (x, y) means that x ⇒ T y by using a production α → β, x has exactly one occurrence of α, and y has exactly one occurrence of β: Step The formulas Subw 1 (α, x) and Subw 1 (β, y) express that the occurrences of α and β are unique. If x ⇒ T y, then x = wαz, y = wβz for some w and z. Then we may choose u = w, v = z. Conversely, if the formula is true, then x begins with the word uα and ends with the word αv. Since the occurrence of α in x is unique, this means that x = uαv. Similarly, y = uβv. Therefore, x ⇒ T y.
The relation Consec(x, y, z) means that x and y are two consecutive words over Σ in z: Note that this relation does not verify whether x ⇒ T y. It only checks that y immediately follows x.
If the relation is true, then z = v$x$y$w for some words v and w. We may choose u = x$y. This word is a subword of z, contains exactly one symbol $, begins with x$, and ends with $y. Now let the formula be true. Then x$ is a prefix of u, and $y is a suffix of u. Moreover, since Subw 1 ($, u) holds, then this symbol $ occurs in the same place. This means that u = x$y, and also that neither x nor y contains the separator $. Then z has a subword $x$y$; therefore, x and y are two consecutive words over Σ.
The relation Der 1 (x) means that x encodes a derivation in the semi-Thue system: Step α→β (u, v)).
By definition x begins and end with $. Moreover, x = $ since m ≥ 1. Let u and v be two arbitrary words. If one of them is not a subword of x or contains $, then the implication is true since Consec(u, v, x) is false. If they are not consecutive configurations, then the implication is also true. Finally, if u = α i , v = α i+1 , then Step α→β (u, v) is true for some production α → β. In all cases the formula is true. Conversely, let the formula be true. Then x = $, x begins and ends with $. This means that x can be written as x = $α 1 $α 2 $ . . . $α m $ where neither of the words α i contains $. For every i the formula Consec(α i , α i+1 , x) is true; therefore, the conclusion of the implication is also true, and α i ⇒ T α i+1 for some production.
The main relation Der(x, y) is defined as follows: .
If x ⇒ * T y, then there exists a derivation which is encoded by the word z = $α 1 $α 2 $ . . . $α m $ where α 1 = x, α m = y. This word begins with $x$ and ends with $y$. Moreover, neither x nor y contains the symbol $. This means that the formula is true. Now, let the formula be true. Then x and y do not contain $, and also there exists a word z encoding some derivation. If z contains exactly two symbols $, then z = $x$ = $y$; therefore, x = y and x ⇒ * T y. If z contains more than two symbols $, then z = $α 1 $α 2 $ . . . $α m $ where α 1 = x, α m = y. In this case also x ⇒ * T y due to the fact that the formula Der 1 (z) is true. Now we prove a technical lemma which will be used later to encode natural numbers.
Lemma 2. Let the relation Num(x) mean that x = (bab) k for some natural number k > 0. Then Num is definable in the theory T 1 .
Proof. The relation Num(x) is defined as follows: It is straightforward to check that the words of the form babbab . . . bab satisfy the formula. Now, let the formula be true. The word x cannot be empty because it begins with b. We prove the following statement by induction on n: if |x| > 3n for some natural number n, then x begins with the prefix (bab) n+1 .
Base case. Let n = 0. Due to Pref(b, x) the word x begins with b. The case x = b is impossible since x contains a due to Subw(a, x). But the second symbol can be only a because otherwise x would start with bb which contradicts ¬Pref(bb, x). Thus, x begins with ba. The case x = ba is also impossible since x can only end with b due to Suf(b, x). The third symbol can be only b because x does not contain aa as a subword due to ¬Subw(aa, x). Thus, x necessarily begins with bab.
Induction step. Let |x| > 3(n + 1). By induction hypothesis x begins with the prefix y = (bab) n+1 . The next symbol can be only b because otherwise x would contain a subword aba contradicting ¬Subw(aba, x). This symbol b cannot be the last one because ¬Suf(bb, x) is true. The next symbol cannot be b since the word bbb has no occurrences in x due to ¬Subw(bbb, x). Therefore, x begins with yba. Since both formulas Suf(b, x) and ¬Subw(aa, x) are true, then the next symbol is b. Thus, x begins with ybab = (bab) n+2 . Now, let x be some word for which the formula is true, and let 3n < |x| ≤ 3(n + 1). Then x = (bab) n+1 y for some y. Since x ≤ 3(n + 1), then y = ε and x = (bab) n+1 . Now we establish our main result on the algorithmic complexity of T 1 . Proof. We need to define addition and multiplication of natural numbers in the theory T 1 . Let M + = (Q, ∆, P, q 1 , q f ) be a Turing machine which adds numbers written in the unary system. We may assume that the number p is encoded by the word | p+1 in order to avoid using the empty word when p = 0. Then Let T + be a semi-Thue system corresponding to the machine M + . Since we have only two symbols a and b, we use the standard encoding: • b i is encoded as ba i b; • q j is encoded as ba n+j b; • # is encoded as ba m+n+1 b; • $ is encoded as ba m+n+2 b.
We may assume that b 1 = |, b 2 = Λ, q 2 = q f . Since the symbol | is encoded by the word bab, then the natural number k is encoded as (bab) k+1 . In particular, the number 0 is represented as the word bab. By Lemma 2 the relation Num(x) verifies whether the word x is a code of some Now we can define the relation Add(x, y, z) which means that the word z represents the sum of two numbers encoded by the words x and y: Here we replace every symbol from the alphabet Q ∪ ∆ ∪ { #, $ } with its code in the subformula Der.
In an analogous way we can define the relation Mult(x, y, z) which means that the word z represents the product of two numbers encoded by the words x and y. The only difference is that the formula Der is constructed using a Turing machine for multiplication.
Let ϕ be an arbitrary formula of arithmetic. We may assume that all its atomic subformulas are of the form x = y, x + y = z, or x × y = z. For the formula ϕ we define its translation T(ϕ): It is easy to see that the formula ϕ is true in arithmetic if and only if T(ϕ) ∈ T 1 .
In order to prove the converse, we fix some "natural" enumeration of all words over the alphabet Σ. For example, we may order all the words by their lengths, and the words of the same length are ordered lexicographically. Let w k be the word with the number k. Then the relations Pref(w x , w y ) and Suf(w x , w y ) are computable, and, consequently, they are representable in arithmetic (see [12]). Therefore, for every formula ϕ of the language Ω 1 we can construct an arithmetical formula ψ such that ψ is true in arithmetic if and only if ϕ ∈ T 1 . Now we strengthen Theorem 1. We prove that the theory T 1 remains undecidable even if no constant symbols are available. Proof. The empty word ε is definable in the theory T 1 : The correctness of this definition follows from the fact that ε is the only word which has exactly one prefix -itself.
The relation Symb(x) is also definable meaning that x is a symbol: Every symbol is nonempty and has exactly two prefixes -the empty word and itself. Conversely, if |x| ≥ 2, then x = abz for some symbols a, b and some word z. Therefore, the formula is false for y = a.
Let ϕ be a formula of the language Ω 1 where two constant symbols a and b are used. We replace them with two variables denoted also a and b, and we construct the following formula ψ: Then ϕ ∈ T 1 if and only if ψ ∈ T 1 .
Let us note that if the language contains only one of the predicate symbols Pref or Suf, then such structure is automatic. Therefore, its theory is decidable (see [13,14]). Corollary 1. Let x −1 denote the inverse of the word x: (a 1 a 2 . . . a n ) −1 = a n . . . a 2 a 1 . Then the theory T 1 of the structure with the universe Σ * (|Σ| ≥ 2) and the language Ω 1 = (Pref (2) ; x −1(1) ) is equivalent to elementary arithmetic.

Decidability of the word theory with power operations
In this section we study the word theory with power operations x i = x · . . . · x i times . We expand the language with the constant symbol ε denoting the empty word. Also we add the predicate symbols R i , i = 2, 3, . . . , meaning that x is the i-th power of some word. Let T 2 be the theory of the structure with the universe Σ * for some alphabet Σ = { a 1 , a 2 , . . . , a r }, r ≥ 1, and with the language Ω 2 = (R (1) i ; x i(1) , ε (0) ), i = 2, 3, . . . . Note that the symbols ε and R i are definable in the original language: In the proof of decidability of T 2 we will use the following well-known result (see [15,16]).
Theorem 3 (Lyndon-Schützenberger theorem). Let x i = y j for some words x, y and some numbers i, j. Then x = z k , y = z l for some word z and some numbers k, l.
At first we establish a simple property of the predicates R i .
Lemma 3. The following equivalence holds in the theory T 2 (here lcm denotes the least common multiple): Proof. At first we consider the case m = 2. If the formula R lcm(b 1 ,b 2 ) (x) is true, then x = y lcm(b 1 ,b 2 ) for some word y. Therefore, x = (y lcm(b 1 ,b 2 )/b 1 ) b 1 = (y lcm(b 1 ,b 2 )/b 2 ) b 2 , and, consequently, the formulas R b 1 (x) and R b 2 (x) are true. Now let the formulas R b 1 (x) and R b 2 (x) be true, i.e. x = y b 1 = z b 2 for some words y and z. It follows from Lyndon-Schützenberger theorem that there exists a word v such that y = v i , z = v j for some i and j. Therefore, x = v ib 1 = v jb 2 . It follows from this equality that ib 1 is divisible by both b 1 and b 2 . So ib 1 is divisible by lcm(b 1 , b 2 ), and, consequently, the formula R lcm(b 1 ,b 2 ) (x) is true.
For an arbitrary m the equivalence is proved by induction on m using the equality lcm (lcm(a, b), c) = lcm(a, b, c). Now we prove that the theory T 2 is decidable.
Theorem 4. The theory T 2 admits effective quantifier elimination and is, therefore, decidable.
Proof. It is sufficient to show how to eliminate a quantifier from the formula of the form (∃x)θ where θ is a conjunction of atomic formulas and their negations (see [12]). Let where the terms s i and t j do not contain the variable x. We may assume that the formula has no subformulas of the form x p = x q since they are equivalent either to when p = q, or to x = ε when p = q (here the symbol denotes a formula that is always true). It is easy to see that for every k ≥ 1 x = y ≡ T 2 x k = y k , R b (x) ≡ T 2 R kb (x k ).
Let N be the least common multiple of all the numbers p i , q i , p i , q i . Then for some s i , t i , b i , c i . Now the exponent of x is the same for all occurrences of x in ϕ.
Let us suppose at first that the formula ϕ contains equalities. Then Now it is left to show how to eliminate a quantifier from the formula without equalities. Let Let B be the least common multiple of the numbers b 1 , . . . , b m . Then by Lemma 3