Symbolic tree automata

https://doi.org/10.1016/j.ipl.2014.11.005Get rights and content

Highlights

  • Classical tree automata are extended to work modulo infinite alphabet theories.

  • Most important classical properties effectively generalize to the symbolic case.

  • The key results are effective closure under complement, and intersection.

  • The results are relevant in program analysis using state-of-the-art SMT solvers.

Abstract

We introduce symbolic tree automata as a generalization of finite tree automata with a parametric alphabet over any given background theory. We show that symbolic tree automata are closed under Boolean operations, and that the operations are effectively uniform in the given alphabet theory. This generalizes the corresponding classical properties known for finite tree automata.

Introduction

Finite word automata and finite tree automata provide a foundation for a wide range of applications in software engineering, from regular expressions to compiler technology and specification languages. Despite their immense practical use, explicit representations are not feasible in the presence of finite large alphabets. They require each transition to encode only a single element from the alphabet. For example, string characters in standard programming languages (such as the char type in C#) use 16-bit bit-vectors, an explicit representation would thus require an alphabet of size 216. Moreover, most common forms of finite automata do not support infinite alphabets.

A practical solution to the representation problem is symbolic tree automata. They are an extension of classical tree automata that addresses this problem by allowing transitions to be labeled with arbitrary formulas in a specified label theory. While the idea of allowing formulas is straightforward, typical extensions of finite tree automata often lead to either undecidability of the emptiness problem, such as tree automata with equality and disequality constraints [1], or many extensions lead to nonclosure under complement, such as the generalized tree set automata class [1], finite-memory tree automata [2] that generalize finite-memory automata [3] to trees, or unranked data tree automata [4]. We show that this is not the case for symbolic tree automata. The key distinction is that the extension here is with respect to characters rather than adding symbolic states or adding constraints over whole subtrees.

The symbolic extension is practically useful for exploiting efficient symbolic constraint solvers when performing basic automata-theoretic transformations: it enables a separation of concerns. The solver is used as a black box with a clearly defined interface that exposes the label theory as an effective Boolean algebra. The chosen label theory can be specific to a particular problem instance. For example, even when the alphabet is finite, e.g., 16-bit bit-vectors, it may be useful for efficiency reasons to use integer-linear arithmetic rather than bit-vector arithmetic when the solver is more efficient over integers and when only standard arithmetic operations (and no bit-level operations) are being used. Recent work [5], [6] on symbolic string recognizers and transducers takes advantage of this observation.

We here investigate the case of the more expressive class of symbolic tree automata. Even though a symbolic tree automaton is a finite object, a key point is that the number of interpretations for symbolic labels does not need to be finite. For example, as a consequence of our main result (Theorem 2) a label theory may itself be the theory of symbolic tree automata (over some basic label theory).

In order to use classical tree automata algorithms, it is possible to reduce a symbolic tree automaton A into a classical finite tree automaton whose alphabet is given by all of the satisfiable Boolean combinations of guards that occur in A. However, such a transformation is in general not practical because it introduces an exponential increase in the size of the automaton before the actual algorithm is applied. Moreover, when more than one automaton are involved, this has to be done up front for all predicates that occur in all the automata in order to define the common alphabet. A concrete example of such a blowup is given in [7, Example 2].

Section snippets

Definition of symbolic tree automata

We introduce an extension of tree automata with an effective encoding of labels by predicates that denote sets of labels, rather than individual labels. We assume a countable background universe B. A predicate φ over B is a finite representation of a subset φB of B; we write φ when B is clear from the context. We assume given an effectively enumerable set of predicates Σ such that, for each element aB there is aˆΣ such that aˆ={a}, ,Σ such that =B and =, and Σ is effectively

Determinization of symbolic tree automata

Similar to the case of deterministic frontier-to-root tree recognizers, DSTAs have the same expressive power as general STAs. We lift the classical powerset construction of nondeterministic Rabin-Scott recognizers to STAs. Let (X) denote the powerset of a set X.2 We write pφq¯ for the rule (p,φ,q¯).

Definition 5

Let A=(Σ,Q,QL,QR,Δ). The powerset STA of A is:(A)=def(Σ,(Q),{QL},{q(Q)|qQR},{lhs(S)μ(S,Δ(q¯)S)q¯|q¯(Q)×(Q),SΔ(q¯)}) whereΔ(q1,q2)=def{ρ|ρΔ,rhs(ρ)q1×q2}μ(S,S)=

Boolean closure of symbolic tree automata

For complete closure under Boolean operations we use the following product construction that is a lifting of the standard product of finite tree automata to STAs.

Definition 6

Let Ai=(Σ,Qi,QiL,QiR,Δi), for i=1,2, be STAs. The product of A1 and A2 is the STAA1×A2=def(Σ,Q1×Q2,Q1L×Q2L,Q1R×Q2R,{ρ1×ρ2|ρ1Δ1,ρ2Δ2}) where, for i{1,2} and ρi=(pi,φi,qi,ri)Δi,ρ1×ρ2=def((p1,p2),φ1φ2,(q1,q2),(r1,r2))

Lemma 4 implies that we can effectively intersect languages of STAs. The proof of (a) follows by induction over trees.

Related work

Our interest in automata and transducers with symbolic alphabets originally surfaced in the context of security analysis of string sanitization routines [6]. Sanitizers transform untrusted data to trusted data as a first line of defense against cross site scripting (XSS) attacks in web browsers. Symbolic transducers were generalized to symbolic tree transducers (STTs) in [12]. Boolean closure operations of STAs were initially studied in [13] where preliminary results corresponding to Theorem 1

References (25)

  • H. Comon et al.

    Tree automata techniques and applications

  • M. Kaminski et al.

    Tree automata over infinite alphabets

  • M. Kaminski et al.

    Finite-memory automata

  • C. David et al.

    Efficient reasoning about data trees via integer linear programming

  • M. Veanes et al.

    Symbolic automata constraint solving

  • M. Veanes et al.

    Symbolic finite state transducers: algorithms and applications

  • L. D'Antoni et al.

    Minimization of symbolic automata

  • F. Gécseg et al.

    Tree Automata

    (1984)
  • P. Hooimeijer et al.

    An evaluation of automata algorithms for string analysis

  • L. D'Antoni et al.

    Fast: a transducer-based language for tree manipulation

  • M. Veanes et al.

    Symbolic tree transducers

  • Cited by (12)

    View all citing articles on Scopus
    View full text