Learning minimal automata with recurrent neural networks

Aichernig, Bernhard K.; König, Sandra; Mateis, Cristinel; Pferscher, Andrea; Tappler, Martin

doi:10.1007/s10270-024-01160-6

Learning minimal automata with recurrent neural networks

Special Section Paper
Open access
Published: 21 March 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Software and Systems Modeling Aims and scope Submit manuscript

Learning minimal automata with recurrent neural networks

Download PDF

Bernhard K. Aichernig¹,
Sandra König^2,3,
Cristinel Mateis²,
Andrea Pferscher ORCID: orcid.org/0000-0002-9446-9541^1,4 &
…
Martin Tappler^1,5

398 Accesses
1 Altmetric
Explore all metrics

Abstract

In this article, we present a novel approach to learning finite automata with the help of recurrent neural networks. Our goal is not only to train a neural network that predicts the observable behavior of an automaton but also to learn its structure, including the set of states and transitions. In contrast to previous work, we constrain the training with a specific regularization term. We iteratively adapt the architecture to learn the minimal automaton, in the case where the number of states is unknown. We evaluate our approach with standard examples from the automata learning literature, but also include a case study of learning the finite-state models of real Bluetooth Low Energy protocol implementations. The results show that we can find an appropriate architecture to learn the correct minimal automata in all considered cases.

Constrained Training of Recurrent Neural Networks for Automata Learning

Gradient-Based Learning of Finite Automata

A Novel Multi-step Finite-State Automaton for Arbitrarily Deterministic Tsetlin Machine Learning

1 Introduction

Models are at the heart of any engineering discipline. They capture the necessary abstractions to master the complexity in a systematic design and development process. In software engineering, models are used for a variety of tasks, including specification, design, code-generation, verification, and testing. In formal methods, these models are given formal mathematical semantics to reach the highest assurance levels. This is achieved through (automated) deduction, i.e., the reasoning about specific properties of a general model.

With the advent of machine learning, there has been a growing interest in the induction of models, i.e., the learning of formal models from data. We have seen techniques to learn deterministic and non-deterministic finite state machines, Mealy machines, timed automata, and Markov decision processes. In this research, called automata learning [1], model learning [2], or model inference [3], specific algorithms have been developed that either start from given data (passive learning) [4] or actively query a system during learning (active learning) [5]. Two prominent libraries that implement such algorithms are AALpy [6] (implemented in Python) and LearnLib (implemented in Java) [7].

An alternative to specific algorithms is to map the automata learning problem to another domain. For example, it was shown that the learning problem can be encoded as SAT [8,9,10,11] or SMT [12, 13] problem, and then it is the task of the respective solver to find a model out of the given data.

In this work, we explore the question of whether machine learning can be harnessed for automata learning. That is, we research if and how the problem of automata learning can be mapped to a machine learning architecture. Our results show that a specific recurrent neural network (RNN) architecture is able to learn a Mealy machine from given data. Specifically, we approach the classic NP-complete problem of inducing an automaton with at most k states that is consistent with a finite sample of a regular language [14]. Figure 1 depicts the basic procedure of the presented RNN-based learning technique. Given a set of traces from a black-box system, we train an RNN from which we extract an automaton that models the behavior of the system.

The main contributions of this work can be summarized as follows: (i) a novel architecture for automata learning by enhancing classical RNNs, (ii) a specific constrained training approach exploiting regularization, (iii) a systematic evaluation with standard grammatical inference problems and a real-world case study, and (iv) evidence that we can find an appropriate architecture to learn the correct automata in all considered cases.

This is an extended version of a conference paper presented at SEFM 2022, the 20^th International Conference on Software Engineering and Formal Methods [15]. The new contributions comprise (i) a generalized algorithm for which the number of states of the automaton needs not be known, but can construct an automaton with minimal states, and (ii) the corresponding new evaluation.

This new generalization of our learning technique is non-trivial. Figure 1 illustrates the iterative procedure of our extension. The main idea is as follows: First, we determine an upper-bound of states that are necessary to capture the data in a Mealy machine. For this, we build a tree out of the given data traces and count the nodes. Then, we initialize an RNN with the size proportional to this upper bound. During training, we search for an automaton representation that captures the data. Once an automaton is found, we check with a standard minimization algorithm if a smaller automaton exists. If the minimized automaton has fewer states than previously found automata, we train an RNN with this smaller target number of states. This is done until we find a minimal automaton that is consistent with the training data. To sum up, we start with a tree-shaped model without any generalization and end with a minimal automaton. Our evaluation demonstrates that this is indeed possible.

The rest of the paper is structured as follows. Section 2 introduces preliminary work. In Sect. 3, we present our automata learning technique based on RNNs. Section 4 discusses the results of the conducted case studies. We compare to related work in Sect. 5, followed by concluding remarks in Sect. 6.

2 Preliminaries

2.1 Recurrent neural networks

Recurrent neural networks (RNNs) are a popular choice for modeling sequential data, such as time-series data [16]. The classical version of an RNN with feedback from a hidden layer to itself is known as vanilla RNN [17].

A vanilla recurrent neural network with input x and output y is defined as

$$\begin{aligned} h^{<t>}&= f\left( W_{hx} x^{<t>} + W_{hh} h^{<t-1>} + b_h\right) \\ \hat{y}^{<t>}&= g\left( W_y h^{<t>} + b_y\right) \end{aligned}$$

where f and g are activation functions for the recurrent and the output layer, respectively. Popular activation functions for the recurrent layer are rectified linear unit (ReLU) and hyperbolic tangent (tanh), whereas the softmax or hardmax functions may be used for g when categorical output values shall be predicted. The activation functions for the output values are defined as

$$\begin{aligned} \hbox {softmax}(z)[i] = \frac{e^{z[i]}}{\sum \nolimits _{n=1}^{N} e^{z[i]}}, \end{aligned}$$

and

$$\begin{aligned} \hbox {hardmax}(z)[i]= {\left\{ \begin{array}{ll} 1, &{} \text {if}\quad z[i] = \textit{max}(z)\\ 0&{} \text {otherwise} \end{array}\right. } \end{aligned}$$

for $i \in 1 \ldots N$ and $z = (z[1],\ldots ,z[N]) \in \mathbb {R}^N$, where $\hbox {softmax}(z)$ provides a probability distribution over the values of vector z and $\hbox {hardmax}(z)$ assigns the probability of one to the index in z with the highest value. The parameters, aka weights, $\Theta = (W_{hx}, W_{hh}, b_h, W_y, b_y)$ need to be learned. The input to the network at time step t is $x^{<t>}$, whereas $\hat{y}^{<t>}$ is the corresponding network’s prediction. $h^{<t>}$ is referred to as the hidden state of the network and is used by the network to access information from past time steps or equivalently, pass relevant information from the current time step to future steps.

An RNN maps an input sequence $\textbf{x}$ to an output sequence ${\hat{\textbf{y}}}$ of the same length. It is trained based on training data $\{(\textbf{x}_1,\textbf{y}_1), \ldots , (\textbf{x}_m,\textbf{y}_m)\}$ containing m sequence pairs. While processing input sequences $\textbf{x}_i = (x_i^{<1>}, \ldots , x_i^{<n>})$, values of the parameters $\Theta $ are learned to minimize the error between the true outputs $\textbf{y}_i = (y_i^{<1>}, \ldots , y_i^{<n>})$ and the network’s predictions $(\hat{y}^{<1>}_i, \ldots , \hat{y}^{<n>}_i)$.

The error is measured through a predefined loss function. The most popular loss functions are the mean squared error for real-valued $y^{<t>}$, and the cross-entropy loss for categorical $y^{<t>}$.

Gradient-based methods are used to minimize the error by iteratively changing each weight in proportion to the derivative of the actual error with respect to that weight until the error falls below a predefined threshold for a fixed number of iterations.

2.2 Finite state machines

We consider finite-state machines (FSMs) in the form of Mealy machines:

Definition 1

A Mealy machine is a 6-tuple $\langle Q, q_0, I,O,\delta ,\lambda \rangle $ where

Q is a finite set of states containing the initial state $q_0$,
I and O are finite sets of input and output symbols,
$\delta : Q \times I \rightarrow Q$ is the state transition function, and
$\lambda : Q \times I \rightarrow O$ is the output function.

Starting from a fixed initial state, a Mealy machine $\mathcal {M}$ responds to inputs $i \in I$, by changing its state according to $\delta $ and producing outputs $o \in O$ according to $\lambda $. Given a sequence of inputs $\textbf{i} \in I^*$, $\mathcal {M}$ produces an output sequence $\textbf{o} = \lambda ^*(q_0, \textbf{i})$, where $\lambda ^*(q, \epsilon ) = \epsilon $ for the empty sequence $\epsilon $ and $\lambda ^*(q, i \cdot \textbf{i}) = \lambda (q,i) \cdot \lambda ^*(\delta (q,i), \textbf{i})$, i is an input, $\textbf{i}$ is an input sequence, and $\cdot $ denotes concatenation. Given input and output sequences $\textbf{i}$ and $\textbf{o}$ of the same length, we use $t(\textbf{i},\textbf{o})$ to create a sequence of input–output pairs in $(I \times O)^*$. We call such a sequence of pairs a trace.

A Mealy machine $\mathcal {M}$ defines a regular language over $I \times O$: $L(\mathcal {M}) = \{t(\textbf{i}, \textbf{o}) \mid \textbf{i} \in I^*, \textbf{o} = \lambda ^*(q_0, \textbf{i})\} \subseteq (I \times O)^*$. The language contains the deterministic response to any input sequence and excludes all other sequences. We can now formalize the problem that we tackle in this paper: Given a finite set of traces $S \subset (I \times O)^*$, we learn a Mealy machine $\mathcal {M}$ with at most n states such that $S \subseteq L(\mathcal {M})$, by training an RNN. This is a classic NP-complete problem in grammatical inference [14]. Usually, it is stated for (DFAs), but any DFA can be represented by a Mealy machine with $ true $ and $ false $ as outputs, denoting whether a word (input sequence) is accepted.

Example 1

(Model of Ping Server) Figure 2 shows a Mealy machine of a simple ping server that responds to pings after a connection has been established. The model has three states that are connected with transitions labeled by pairs of inputs and outputs. For example, from the initial state $q_0$, the server responds with $ ConnAck $ to the input $ Connect $. Any further $ Connect $ input leads to a closing of the connection with the corresponding output observation $ ConnectionClosed $.

Next, we introduce auxiliary techniques that are related to automata learning. With the first, we compute a bound on the number of FSM states that are sufficient for a Mealy machine to produce a set of given traces. The second technique, FSM minimization, computes a Mealy machine $\mathcal {M}'$ from a Mealy machine $\mathcal {M}$, such that $\mathcal {M}'$ has the minimal number of states and its language is equivalent to $\mathcal {M}$.

2.2.1 Bounding FSM size

Let $S \subset (I \times O)^*$ be a finite set of traces and let $\ll $ be the reflexive prefix relation on traces. To compute an upper bound on the number of FSM states sufficient to produce S, we create a prefix-tree acceptor (PTA) from S and use its number of nodes as a bound for the number of FSM states that are sufficient to produce S. PTA creation is a common preprocessing step in automata learning algorithm [18], like RPNI [4]. We create input–output prefix tree acceptors (IOPTAs), which are a variation of PTAs similar to the PTAs used by IoAlergia [19]. An IOPTA T is a tree that compactly represents a trace sample S with edges labeled by inputs and nodes, except the root, labeled by outputs. Hence, a path from the root to a node of T is labeled by a trace $(I \times O)^*$.

An IOPTA T created for a trace sample S contains a path labeled by a trace t iff S contains a trace $t'$ with $t\ll t'$, i.e., T contains a path for every trace prefix. Thus, T can be created from S by merging traces with common prefixes. Such an IOPTA T is a partial^{Footnote 1} Mealy machine that defines exactly the prefix-closure of S. We can deduce that the number of nodes of T is a bound on the number of FSM states sufficient to represent S. Since languages defined by a Mealy machine are prefix-closed, IOPTA computation does not introduce any generalization.

Example 2

(IOPTA of ping server) Suppose we sampled a set of three traces that includes:

$ Connect / ConnAck \cdot Connect / ConnectionClosed $ $\cdot $ $ Connect / ConnAck $
$ Connect / ConnAck \cdot Ping / Pong \cdot Ping / Pong $
$ Ping / ConnectionClosed \cdot Connect / ConnAck \cdot Ping / Pong $

The corresponding IOPTA is shown in Fig. 3, where outputs are put in curly braces to distinguish them from inputs. The IOPTA has nine nodes; thus, we know that nine states are sufficient to model the ping server.

2.2.2 FSM minimization

Minimization of FSMs basically partitions the states Q of a Mealy machine $\mathcal {M}$ into blocks B that are equivalent w.r.t. $\lambda ^*$; see Hopcroft et al. [20] for minimization of DFAs. That is, two states are grouped into a block if they produce the same outputs, and thus cannot be distinguished. Given such a partition B, a minimal Mealy machine $\mathcal {M}'$ can be constructed with states B, i.e., states given by blocks of indistinguishable states. Transitions between $b \in B$ and $b' \in B$ are created if there is a corresponding transition between $r\in b$ and $s \in b'$ in $\mathcal {M}$. Note that $\mathcal {M}'$ is unique up to a renaming. Active automata learning algorithms, like $L^*$, have minimality as an inherent property, whereas we apply minimization as an additional step. Efficient algorithms, such as Hopcroft’s FSM minimization algorithm [21] have an $n \log n $ worst-case runtime complexity. Hence, the runtime overhead of the minimization step is negligible.

Example 3

(Minimization of Ping Model) The model shown in Fig. 2 is non-minimal. The states $q_1$ and $q_2$ are equivalent, as there is no sequence that distinguishes them. Hence, a minimization would create partition $\{\{q_0\},\{q_1,q_2\}\}$. Based on that, we can create the minimal Mealy machine shown in Fig. 4.

2.3 Automata learning

Automata learning creates behavioral FSMs of black-box systems. Figure 5 illustrates the general framework for learning a reactive system model in the form of a Mealy machine. The goal of automata learning is to create a model $\mathcal {M}$ such that $L(\mathcal {M}) = L(\mathcal {M}_\textrm{SUL})$, where $\mathcal {M}_\textrm{SUL}$ is an unknown Mealy machine representing the System Under Learning (SUL).

We distinguish between active and passive learning algorithms. Passive learning creates a behavioral model from a given set of traces. To learn a Mealy machine $\mathcal {M}_P$, passive learning infers from a finite set of traces $S \subset (I \times O)^*$ a model $\mathcal {M}_P$ such that $S \subseteq L(\mathcal {M}_P)$, often restricting $\mathcal {M}_P$ to have at most k states. Given that $S \subseteq L(\mathcal {M}_\textrm{SUL})$, most algorithms guarantee $L(\mathcal {M}_P) = L(\mathcal {M}_\textrm{SUL})$ for large enough S and finite $\mathcal {M}_\textrm{SUL}$ [18]. One challenge in the application of passive learning is to provide a finite set of traces such that $L(\mathcal {M}_P) = L(\mathcal {M}_\textrm{SUL})$.

Active automata learning queries the SUL to create a behavioral model. Many active learning algorithms are based on the $L^*$ algorithm [5] which is defined for different modeling formalisms like Mealy machines [22]. $L^*$ queries the SUL to generate a finite set of traces $S \subset (I \times O)^*$ from which a hypothesis Mealy machine $\mathcal {M}_A$ is constructed that fulfills $S \subseteq L(\mathcal {M}_A)$. $L^*$ guarantees that the $\mathcal {M}_A$ is minimal. The hypothesis $\mathcal {M}_A$ is then checked for equivalence to the language $L(\mathcal {M}_\textrm{SUL})$. Since $\mathcal {M}_\textrm{SUL}$ is unknown, checking the behavioral equivalence between $\mathcal {M}_\textrm{SUL}$ and $\mathcal {M}_A$ is generally undecidable. Hence, conformance testing is used to substitute the equivalence oracle in active learning. Model-based testing techniques generate a finite set of traces $S_\mathcal {T} \subset (I \times O)^*$ from executions on $\mathcal {M}_A$ and check if $S_\mathcal {T} \subset L(\mathcal {M}_\textrm{SUL})$. If $t(\textbf{i}, \textbf{o}) \notin L(\mathcal {M}_\textrm{SUL})$, a counterexample to the behavioral equivalence between $\mathcal {M}_\textrm{SUL}$ and $\mathcal {M}_A$ is found. Based on this trace, the set of traces $S \subset (I \times O)^*$ is extended by performing further queries. Again a hypothesis $\mathcal {M}_\textrm{A}$ is created and checked for equivalence. This procedure repeats until no counterexample to the equivalence between $L(\mathcal {M}_\textrm{SUL})$ and $L(\mathcal {M}_\textrm{A})$ can be found. The algorithm then returns the learned automaton $\mathcal {M}_\textrm{A}$. Note that $L^*$ creates $\mathcal {M}_\textrm{A}$ such that $S \subset L(\mathcal {M}_\textrm{A})$. With access to a perfect behavioral equivalence check, which provides any differences between the languages defined by $\mathcal {M}_\textrm{SUL}$ and $\mathcal {M}_A$, we could guarantee that the generated finite set of traces S enables learning a model $\mathcal {M}_A$ such that $L(\mathcal {M}_A) = L(\mathcal {M}_\textrm{SUL})$.

3 Automata learning with RNNs

In this section, we first present the problem that we tackle and propose an RNN architecture as a solution. After that, we cover (i) the constrained training of the proposed RNN architecture with our specific regularization term, and (ii) the usage of the trained RNN to extract an appropriate automaton. Finally, we propose an iterative learning algorithm that uses the proposed RNN architecture and automaton extraction to learn a minimal automaton without knowing the minimal number of states.

3.1 Overview and architecture

It is well known that recurrent neural networks (RNNs) can be used to efficiently model time-series data, such as data generated from the interaction with a Mealy machine. Concretely, this can be done by using the machine inputs $x^{<t>}$ as inputs to the RNN and minimizing the difference between the machine’s true outputs $y^{<t>}$ and the RNN’s predictions $\hat{y}^{<t>}$. In other words, the RNN would predict the language $L(\mathcal {M})$ of a Mealy machine $\mathcal {M}$.

This optimization process can be performed via gradient descent. Even if such a trained RNN can model all interactions with perfect accuracy, one disadvantage compared to the native automaton representation as, e.g., a Mealy machine, is that it is much less interpretable. While each state in a Mealy machine can be identified by a discrete number, the hidden state of the RNN, which is the information passed from one time step to the next one, is a continuous real-valued vector. This vector may be needlessly large and contain mostly redundant information. Thus it would be useful if we could simplify such a trained RNN into a Mealy machine $\mathcal {M}_{R}$ that produces the language of a Mealy machine $\mathcal {M}$ that we want to learn, i.e., with $L(\mathcal {M}) = L(\mathcal {M}_R)$.

We approach the following problem. Given a sample $S \subset L(\mathcal {M})$ of traces $t(\textbf{i}_j,\textbf{o}_j)$ and the number of states k of $\mathcal {M}$, we train an RNN to correctly predict $\textbf{o}_j$ from $\textbf{i}_j$. To facilitate interpretation, we want to extract a Mealy machine $\mathcal {M}_R$ from the trained RNN with at most k states, modeling the same language. For $\mathcal {M}_R$, $S \subset L(\mathcal {M}_R)$ shall hold such that for large enough S we have $L(\mathcal {M}) = L(\mathcal {M}_R)$.

For this purpose, we propose an RNN architecture and learning procedure that ensure that the RNN hidden states can be cleanly translated into k discrete automata states. Compared to standard vanilla RNNs, the hidden states are transformed into an estimate of a categorical distribution over the k possible automaton states. This restricts the encoding of information in the hidden states since now all components need to be in the range [0, 1] and sum up to 1. Figure 6 shows our complete RNN cell architecture for a single hidden layer, implementing the following equations.

$$\begin{aligned} h^{<t>}&= af\left( W_{hx} x^{<t>} + W_{hs} s^{<t-1>} + b_h\right) , \end{aligned}$$

(1)

$$\begin{aligned}&\qquad af\!\in \!\{\hbox {ReLU}, \hbox {tanh}\} \nonumber \\ \hat{y}^{<t>}&= \hbox {softmax}\left( W_y h^{<t>} + b_y\right) \end{aligned}$$

(2)

$$\begin{aligned} \hat{s}^{<t>}&= \hbox {softmax}\left( W_s h^{<t>} + b_s\right) \end{aligned}$$

(3)

$$\begin{aligned} s^{<t>}&= \left\{ \begin{aligned}&\hbox {softmax}\left( W_s h^{<t>} + b_s\right) \\&\quad \ldots \text {if mode}=``{} \texttt {train}\hbox {''}\\&\hbox {hardmax}\left( W_s h^{<t>} + b_s\right) \\&\quad \ldots \text {else (i.e., mode}=``{} \texttt {infer}\hbox {''}) \end{aligned} \right. \end{aligned}$$

(4)

In comparison with vanilla RNN cells, the complete hidden state $h^{<t>}$ is only an intermediate vector of values. Based on $h^{<t>}$, an output $\hat{y}^{<t>}$ is predicted using a softmax activation. A Mealy machine state $\hat{s}^{<t>}$ is predicted as well and passed to the next time step. It is computed via (i) softmax during RNN training, and (ii) via hardmax during inference. During training, see Algorithm 2, we also compute the cross-entropy of $\hat{s}^{<t>}$ with $hardmax(\hat{s}^{<t>})$ as a label, which serves as a regularization term. Inference refers to extracting an automaton from the trained RNN, which takes as input the current system state and an input symbol and gives as output the next system state and an output symbol. Hence, we use softmax to estimate a categorical distribution over possible states for training, whereas we use hardmax to concretely infer one state during inference.

Our algorithm for extracting a Mealy machine from a trained RNN, see Algorithm 3, is based on the idea that if the RNN achieves perfect accuracy when predicting the machine’s true outputs, the hidden state $h^{<t>}$ encodes information corresponding to the state of a Mealy machine at time step t. Otherwise, the RNN would not be able to predict the expected outputs correctly, since those are a function of both the input and the current state. By adapting the RNN architecture, we enforce hidden states to correspond to discrete Mealy machine states.

3.1.1 Multiple hidden layers

The matrices $W_{hx}$, $W_{hs}$, $W_y$, and $W_s$ and the corresponding bias vectors introduced before define the weights of an RNN with a single layer. It may be beneficial to add additional hidden layers to better predict certain complex behaviors. For the hidden layers, we use fully connected layers, where each layer i is defined by a pair ${W_{hh}}_i$ and ${b_{h}}_i$ containing the weights of that layer. Each layer performs an additional transformation of the hidden state $h^{<t>}$.

Concretely, the one-hot-encoded input $x^{<t>}$ and state $s^{<t-1>}$ are first mapped to $h^{<t>}$, that is, Eq. 1 is left unchanged. Let $h^{<t>}_0 = h^{<t>}$, then every additional layer performs the transformation $h^{<t>}_i = af({W_{hh}}_i h^{<t>}_{i-1} + {b_{h}}_i)$. When we have multiple layers, we perform the state and output prediction on the result $h^{<t>}_k$ from the last hidden layer k, that is, we substitute $h^{<t>}$ by $h^{<t>}_k$ in Eqs. 2, 3 and 4. Hence, input processing is carried out only by the first layer, and state and output predictions, along with their corresponding regularization, are performed exclusively in the last layer. All transformations in between are not affected by regularization.

3.2 Training and automaton extraction

In the following, we first discuss how to train an RNN with the structure shown in Fig. 6 such that it will encode an automaton. Secondly, we show how to extract the automaton from a trained RNN. We start by illustrating the basic operation of such an RNN, i.e., the prediction of outputs and state transitions from an input sequence. This is called the forward pass and is used during training and automaton extraction.

Forward pass. Algorithm 1 implements the forward pass taking an input sequence $\textbf{x}$ and a mode variable as parameters. The mode variable distinguishes between training (train) and automaton extraction (infer). The algorithm returns a pair $(\hat{\textbf{y}}, \hat{\textbf{s}})$ comprising the predicted output sequence and the sequence of hidden states visited by the forward pass. We want to learn the language of a Mealy machine, i.e., map $\textbf{i} \in I^*$ to $\textbf{o} \in O^*$ for sets I, O of input and output symbols. Therefore, we encode every $i \in I$ using a one-hot-encoding to yield input sequences $\textbf{x}$ from $\textbf{i} \in I^*$. In this encoding, every i is associated with a unique $\vert I\vert $-dimensional vector, where exactly one element is equal to one and all others are zero. We write x for a one-hot-encoded input i. Analogously, we encode outputs and the hidden state shall approach a one-hot-encoding in a k-dimensional vector space. For one-hot-encoded outputs, we generally use the letter y and we use D to denote one-hot-encoded training datasets derived from a sample $S \subset L(\mathcal {M})$.

Algorithm 1 initializes the output and state sequences $\hat{\textbf{y}}$ and $\hat{\textbf{s}}$ to the empty sequences and the hidden state s of the RNN to the one-hot encoding of the fixed initial state $q_0$. For every input symbol $x^{<t>}$, Lines 4–12 perform the equations defining the RNN, i.e., applying affine transformation using weights and an activation function. At each step t, we compute and store the predicted output $\hat{y}^{<t>}$ (Line 6) and the predicted state $\hat{s}^{<t>}$ (Line 7) in $\hat{\textbf{y}}$ and $\hat{\textbf{s}}$, respectively. In the “train” mode, we pass $\hat{s}^{<t>}$ as hidden state to the next time step (Line 9). In the “infer” mode used for automaton extraction, we apply a hardmax on the hidden state (Line 11) so that exactly one state is predicted.

Training. The architecture is trained by minimizing a prediction loss between $y^{<t>}$ and $\hat{y}^{<t>}$ along with a regularization loss: the cross-entropy of the state distribution $s^{<t>}$ w.r.t. to the state with the highest probability in $s^{<t>}$. On the one hand, our regularization design encourages the RNN to re-use a state at subsequent steps once it has been selected at the current step since this contributes to decreasing the regularization loss. On the other hand, it encourages using as few states as possible overall since any additional used state contributes to increasing the regularization loss. Minimizing our regularization of choice forces the RNN to increase the certainty about the predicted state. This ensures that the hidden states tend to be approximately one-hot-encoded vectors where the index of the maximal component corresponds to the state of a Mealy machine accepting the same language. Note that directly using a discrete state representation is not beneficial when training with gradient descent. Algorithm 2 implements the training in PyTorch-like [23] style. Its parameters are the training dataset D, a sample of the language to be learned, the learning epochs, and a regularization factor, which controls the influence of state regularization. The training is performed using the gradient descent-based Adam optimizer [24]. The algorithm performs up to $\#epochs$ loops over the training data. An epoch processes each trace in the training data referred to as an episode (Lines 4–21). For the actual training, we perform a forward pass in “train” mode and compute the overall loss from the prediction and state regularization losses (Lines 5–9). Lines 10 and 11 update the RNN parameters, i.e., the weights.

Training stops when the prediction accuracy of the RNN operated as an automaton reaches 100% or $\#epochs$ episodes have been performed. To calculate accuracy, we perform a forward pass in the “infer” mode in Line 15 and compute the average accuracy in Line 16. Upon finishing the training, Algorithm 2 returns a Boolean variable indicating if the prediction accuracy converged to $100\%$ and the trained RNN.

The purpose of the trained RNN model is not to predict outputs of new inputs, unseen during training, but to help with inferring an automaton that produces the training data. This automaton shall be used to predict the outputs corresponding to (new) inputs. Thus, we use all available data for training the RNN and aim at achieving perfect accuracy on the training data. Perfect accuracy on the training set gives us the confidence that the internal state representation of the learned RNN model corresponds to the true (partial) automaton that produced that data. In cases where the training data does not cover all states and transitions of the full true automaton, we might learn a partial automaton missing some states and transitions. Using all available data for training reduces the possibility to learn just a partial automaton.

Automaton extraction from a trained RNN. Given a trained RNN model, we extract the corresponding automaton with Algorithm 3. We represent the automaton of a Mealy machine by its set of transitions in the following form:

$$\begin{aligned} T&= \{(s, s', i/o) \mid s, s' \in Q \wedge i \in I \wedge o \in O \wedge \\&\quad \delta (s, i) = s' \wedge \lambda (s, i) = o \}. \end{aligned}$$

Algorithm 3 starts by initializing T to the empty set. Then, it iterates through all episodes, i.e., all traces, from the training set D. At each iteration (Lines 3–13) it first runs the RNN model M on the one-hot-encoded input sequence $\textbf{x}$ of the current episode (Line 3) to obtain the corresponding predicted output symbols and transition state sequence $\hat{\textbf{y}}$ and $\hat{\textbf{s}}$, respectively. For this purpose, we perform the forward pass implemented by Algorithm 1 in the “infer” mode. This mode uses the $ hardmax $ operation to compute the encoded state in each step to ensure stability of the extraction process. A well-trained RNN with our architecture will traverse states that are close to being one-hot-encoded. The states are generally not perfectly one-hot-encoded due to numerical imprecisions and the nature of RNN training. Using $ hardmax $ to compute state suppresses the accumulation of such small imprecision. This is especially relevant when processing long training episodes during automaton extraction. Using the $ softmax $ operation, we might get ambiguous extraction results, which manifest as non-deterministic transitions.

Lines 6–13 iterate through all steps of the current episode. All episodes start from the initial state $q_0$ which, by construction, is assigned the label 0. Thus, we initialize the first state to 0 (Line 4). If the predicted output symbol matches the label at the current step (Line 8), then T is extended by a triple encoding a transition, which is built from the source/target states and the input/output symbols of the current step. By applying argmax on the one-hot encoded input $x^{<t>}$ and output $y^{<t>}$ we get integer-valued discrete representations of them (Line 7). The actual corresponding input symbol, and respective output symbol, are obtained from the input, respective output, and symbol alphabet through an appropriate indexed mapping. For simplicity, we do not show this mapping here. If the predicted output does not match the expected value, the current and remaining steps of the current episode are ignored and the algorithm moves to the next episode (Line 2). An episode consists of a sequence of adjacent steps (or transitions) in the automaton, i.e., the next step starts from the state where the current step ended (Line 13). After processing all training data traces, Algorithm 3 returns the extracted automaton with transitions T. Note that the extracted automaton might not include all states that can be one-hot encoded with a vector with the length of $s^{<t>}$. Hence, the number of states of the extracted automaton might be smaller than the length of $s^{<t>}$.

3.3 Minimal automaton learning

In the following, we explain how the previously proposed algorithms can be used to learn a minimal automaton. In this case, we assume that we do not know in advance the exact number of states of a minimal automaton. Using an iterative approach, we adjust downward the upper bound on the maximal number of states allowed to create an automaton that correctly models the behavior of the given dataset. This iterative approach is repeated until the number of states cannot be reduced any further.

Algorithm 4 describes our iterative learning approach. For learning, we require a dataset D, which is a sample of the language to be learned, and a specific learning strategy. We distinguish between two learning strategies: best effort (“bestEffort”) and exhaustive (“exhaustive”). In the best-effort strategy, a learning iteration (Lines 5–22) ends upon the first successful attempt, if any, to learn an automaton which perfectly explains the training data (Line 19). In contrast, the exhaustive strategy performs all $\#runs$ attempts to learn an automaton that is correct to the dataset and selects among them the minimal one, if any, for further processing. Consequently, the exhaustive strategy tries within the given budget of $\#runs$ to find an even smaller automaton with the known upper bound of the current learning iteration, even if an automaton with the current upper bound has already been found. That is, the exhaustive strategy always executes all $\#runs$ iterations of the for-loop (Line 7), whereas the best-effort strategy at most $\#runs$. Algorithm 4 returns a triplet consisting of (i) the best learned automaton, (ii) a Boolean variable confirming that, at the last learning iteration, an automaton of the same size has been learned as in the previous iteration, meaning that no smaller automaton could be further learned, and (iii) the number of performed learning iterations. The best learned automaton corresponds to an automaton with the fewest number of states learned in all iterations. The Boolean variable indicates whether the RNN model learned the same smallest automaton after setting the upper bound to the minimal number of states found so far.

Algorithm 4 starts by creating an IOPTA from the given dataset D, as described in Sect. 2.2, and initializing the number of learning iterations $\#it$ to 0. The generated tree represents the initial automaton and provides the first upper bound on the number of states based on the number of nodes in the tree. We then start our iterative learning procedure. A learning iteration represents one iteration of the repeat-loop and includes the block from Lines 5–22.^{Footnote 2} We terminate the learning procedure if the learned automaton cannot be further minimized.

Let $states_n(A)$ be a function that returns the number of states of a given FSM A. In Line 5, we save in $states_{\textrm{min}}$ the states number of the currently best learned automaton $A_{\textrm{min}}$, i.e., the automaton with the fewest states learned so far. In the next step, we initialize a Boolean variable $approved_{\textrm{min}}$ to false, represented by $\bot $. This variable indicates whether the RNN model learns again an automaton with the number of states equal to $states_{\textrm{min}}$ after setting the upper bound of states to $states_{\textrm{min}}$ (Line 8) at the current iteration. In the following (Lines 7–22), we attempt to reduce the number of states further to minimize the automaton $A_{\textrm{min}}$ with our RNN-based learning technique.

The attempt to minimize the automaton is limited to a maximum number of runs. In Line 8, we initialize our RNN model M based on the current upper bound on the number of states. In terms of the RNN cell architecture we introduced in Sect. 3.1, the given number of states defines the size of the state vector $s^{<t>}$. We then train the RNN model M on the provided dataset as described in Algorithm 2 (Line 9). Hence, we obtain the trained RNN model M and a Boolean variable indicating whether the trained model achieved $100\%$ accuracy in predicting the outputs of the provided dataset D. If the RNN model converges to $100\%$ accuracy, from Lines 11–20, we (i) extract the automaton, (ii) check the automaton size, and (iii) stop trying to learn further automata with potentially fewer number of states if the best-effort strategy was selected.

(i) First, we extract the automaton A as described in Algorithm 3. Since we do not know the exact number of states, we may learn an automaton that has more states than the minimal automaton representing the dataset D. The RNN model might extract an automaton that has states that cannot be distinguished. As described in Sect. 2.2, we can further minimize an automaton by grouping indistinguishable states. For this, we merge all indistinguishable states of the extracted automaton A and create an equivalent automaton containing only distinguishable states in Line 12. Note that any kind of minimization on A does not affect the behavior learned by the RNN model.

(ii) We then compare the size of the minimized automaton A and the size of the current minimal automaton $A_{\textrm{min}}$ (Line 13). If the newly extracted automaton A has fewer states than the previously learned minimal automaton $A_{\textrm{min}}$, A becomes the new minimal automaton. In the case that we cannot further reduce the automaton size, we set $approved_{\textrm{min}}$ to true, represented by $\top $, indicating that the fixpoint has been reached (Line 16).

Table 1 Description of Tomita grammars

Learning minimal automata with recurrent neural networks

Abstract

Similar content being viewed by others

Constrained Training of Recurrent Neural Networks for Automata Learning

Gradient-Based Learning of Finite Automata

A Novel Multi-step Finite-State Automaton for Arbitrarily Deterministic Tsetlin Machine Learning

1 Introduction

2 Preliminaries

2.1 Recurrent neural networks

2.2 Finite state machines

Definition 1

Example 1

2.2.1 Bounding FSM size

Example 2

2.2.2 FSM minimization

Example 3

2.3 Automata learning

3 Automata learning with RNNs

3.1 Overview and architecture

3.1.1 Multiple hidden layers

3.2 Training and automaton extraction

3.3 Minimal automaton learning

4 Case studies

4.1 Case study subjects

4.1.1 Tomita grammars

4.1.2 Bluetooth Low Energy (BLE)

4.2 Experimental setup

4.3 Results and discussion

4.3.1 Minimal states number is given

4.3.2 Minimal states number is learned

4.3.3 Effects of RNN hyperparameter changes

4.3.4 Comparison to a classic learning algorithm

Remark

4.3.5 Generalizability

5 Related work

6 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation