1 Introduction

Fig. 1
figure 1

Automata learning framework that is based on training a recurrent neural network (RNN) using a given set of traces. To learn a minimal automaton, we adapt the structure of the RNN iteratively

Models are at the heart of any engineering discipline. They capture the necessary abstractions to master the complexity in a systematic design and development process. In software engineering, models are used for a variety of tasks, including specification, design, code-generation, verification, and testing. In formal methods, these models are given formal mathematical semantics to reach the highest assurance levels. This is achieved through (automated) deduction, i.e., the reasoning about specific properties of a general model.

With the advent of machine learning, there has been a growing interest in the induction of models, i.e., the learning of formal models from data. We have seen techniques to learn deterministic and non-deterministic finite state machines, Mealy machines, timed automata, and Markov decision processes. In this research, called automata learning [1], model learning [2], or model inference [3], specific algorithms have been developed that either start from given data (passive learning) [4] or actively query a system during learning (active learning) [5]. Two prominent libraries that implement such algorithms are AALpy [6] (implemented in Python) and LearnLib (implemented in Java) [7].

An alternative to specific algorithms is to map the automata learning problem to another domain. For example, it was shown that the learning problem can be encoded as SAT [8,9,10,11] or SMT [12, 13] problem, and then it is the task of the respective solver to find a model out of the given data.

In this work, we explore the question of whether machine learning can be harnessed for automata learning. That is, we research if and how the problem of automata learning can be mapped to a machine learning architecture. Our results show that a specific recurrent neural network (RNN) architecture is able to learn a Mealy machine from given data. Specifically, we approach the classic NP-complete problem of inducing an automaton with at most k states that is consistent with a finite sample of a regular language [14]. Figure 1 depicts the basic procedure of the presented RNN-based learning technique. Given a set of traces from a black-box system, we train an RNN from which we extract an automaton that models the behavior of the system.

The main contributions of this work can be summarized as follows: (i) a novel architecture for automata learning by enhancing classical RNNs, (ii) a specific constrained training approach exploiting regularization, (iii) a systematic evaluation with standard grammatical inference problems and a real-world case study, and (iv) evidence that we can find an appropriate architecture to learn the correct automata in all considered cases.

This is an extended version of a conference paper presented at SEFM 2022, the 20th International Conference on Software Engineering and Formal Methods [15]. The new contributions comprise (i) a generalized algorithm for which the number of states of the automaton needs not be known, but can construct an automaton with minimal states, and (ii) the corresponding new evaluation.

This new generalization of our learning technique is non-trivial. Figure 1 illustrates the iterative procedure of our extension. The main idea is as follows: First, we determine an upper-bound of states that are necessary to capture the data in a Mealy machine. For this, we build a tree out of the given data traces and count the nodes. Then, we initialize an RNN with the size proportional to this upper bound. During training, we search for an automaton representation that captures the data. Once an automaton is found, we check with a standard minimization algorithm if a smaller automaton exists. If the minimized automaton has fewer states than previously found automata, we train an RNN with this smaller target number of states. This is done until we find a minimal automaton that is consistent with the training data. To sum up, we start with a tree-shaped model without any generalization and end with a minimal automaton. Our evaluation demonstrates that this is indeed possible.

The rest of the paper is structured as follows. Section 2 introduces preliminary work. In Sect. 3, we present our automata learning technique based on RNNs. Section 4 discusses the results of the conducted case studies. We compare to related work in Sect. 5, followed by concluding remarks in Sect. 6.

2 Preliminaries

2.1 Recurrent neural networks

Recurrent neural networks (RNNs) are a popular choice for modeling sequential data, such as time-series data [16]. The classical version of an RNN with feedback from a hidden layer to itself is known as vanilla RNN [17].

A vanilla recurrent neural network with input x and output y is defined as

$$\begin{aligned} h^{<t>}&= f\left( W_{hx} x^{<t>} + W_{hh} h^{<t-1>} + b_h\right) \\ \hat{y}^{<t>}&= g\left( W_y h^{<t>} + b_y\right) \end{aligned}$$

where f and g are activation functions for the recurrent and the output layer, respectively. Popular activation functions for the recurrent layer are rectified linear unit (ReLU) and hyperbolic tangent (tanh), whereas the softmax or hardmax functions may be used for g when categorical output values shall be predicted. The activation functions for the output values are defined as

$$\begin{aligned} \hbox {softmax}(z)[i] = \frac{e^{z[i]}}{\sum \nolimits _{n=1}^{N} e^{z[i]}}, \end{aligned}$$

and

$$\begin{aligned} \hbox {hardmax}(z)[i]= {\left\{ \begin{array}{ll} 1, &{} \text {if}\quad z[i] = \textit{max}(z)\\ 0&{} \text {otherwise} \end{array}\right. } \end{aligned}$$

for \(i \in 1 \ldots N\) and \(z = (z[1],\ldots ,z[N]) \in \mathbb {R}^N\), where \(\hbox {softmax}(z)\) provides a probability distribution over the values of vector z and \(\hbox {hardmax}(z)\) assigns the probability of one to the index in z with the highest value. The parameters, aka weights, \(\Theta = (W_{hx}, W_{hh}, b_h, W_y, b_y)\) need to be learned. The input to the network at time step t is \(x^{<t>}\), whereas \(\hat{y}^{<t>}\) is the corresponding network’s prediction. \(h^{<t>}\) is referred to as the hidden state of the network and is used by the network to access information from past time steps or equivalently, pass relevant information from the current time step to future steps.

An RNN maps an input sequence \(\textbf{x}\) to an output sequence \({\hat{\textbf{y}}}\) of the same length. It is trained based on training data \(\{(\textbf{x}_1,\textbf{y}_1), \ldots , (\textbf{x}_m,\textbf{y}_m)\}\) containing m sequence pairs. While processing input sequences \(\textbf{x}_i = (x_i^{<1>}, \ldots , x_i^{<n>})\), values of the parameters \(\Theta \) are learned to minimize the error between the true outputs \(\textbf{y}_i = (y_i^{<1>}, \ldots , y_i^{<n>})\) and the network’s predictions \((\hat{y}^{<1>}_i, \ldots , \hat{y}^{<n>}_i)\).

The error is measured through a predefined loss function. The most popular loss functions are the mean squared error for real-valued \(y^{<t>}\), and the cross-entropy loss for categorical \(y^{<t>}\).

Gradient-based methods are used to minimize the error by iteratively changing each weight in proportion to the derivative of the actual error with respect to that weight until the error falls below a predefined threshold for a fixed number of iterations.

2.2 Finite state machines

We consider finite-state machines (FSMs) in the form of Mealy machines:

Definition 1

A Mealy machine is a 6-tuple \(\langle Q, q_0, I,O,\delta ,\lambda \rangle \) where

  • Q is a finite set of states containing the initial state \(q_0\),

  • I and O are finite sets of input and output symbols,

  • \(\delta : Q \times I \rightarrow Q\) is the state transition function, and

  • \(\lambda : Q \times I \rightarrow O\) is the output function.

Starting from a fixed initial state, a Mealy machine \(\mathcal {M}\) responds to inputs \(i \in I\), by changing its state according to \(\delta \) and producing outputs \(o \in O\) according to \(\lambda \). Given a sequence of inputs \(\textbf{i} \in I^*\), \(\mathcal {M}\) produces an output sequence \(\textbf{o} = \lambda ^*(q_0, \textbf{i})\), where \(\lambda ^*(q, \epsilon ) = \epsilon \) for the empty sequence \(\epsilon \) and \(\lambda ^*(q, i \cdot \textbf{i}) = \lambda (q,i) \cdot \lambda ^*(\delta (q,i), \textbf{i})\), i is an input, \(\textbf{i}\) is an input sequence, and \(\cdot \) denotes concatenation. Given input and output sequences \(\textbf{i}\) and \(\textbf{o}\) of the same length, we use \(t(\textbf{i},\textbf{o})\) to create a sequence of input–output pairs in \((I \times O)^*\). We call such a sequence of pairs a trace.

A Mealy machine \(\mathcal {M}\) defines a regular language over \(I \times O\): \(L(\mathcal {M}) = \{t(\textbf{i}, \textbf{o}) \mid \textbf{i} \in I^*, \textbf{o} = \lambda ^*(q_0, \textbf{i})\} \subseteq (I \times O)^*\). The language contains the deterministic response to any input sequence and excludes all other sequences. We can now formalize the problem that we tackle in this paper: Given a finite set of traces \(S \subset (I \times O)^*\), we learn a Mealy machine \(\mathcal {M}\) with at most n states such that \(S \subseteq L(\mathcal {M})\), by training an RNN. This is a classic NP-complete problem in grammatical inference [14]. Usually, it is stated for (DFAs), but any DFA can be represented by a Mealy machine with \( true \) and \( false \) as outputs, denoting whether a word (input sequence) is accepted.

Fig. 2
figure 2

Mealy machine of a ping server

Example 1

(Model of Ping Server) Figure 2 shows a Mealy machine of a simple ping server that responds to pings after a connection has been established. The model has three states that are connected with transitions labeled by pairs of inputs and outputs. For example, from the initial state \(q_0\), the server responds with \( ConnAck \) to the input \( Connect \). Any further \( Connect \) input leads to a closing of the connection with the corresponding output observation \( ConnectionClosed \).

Next, we introduce auxiliary techniques that are related to automata learning. With the first, we compute a bound on the number of FSM states that are sufficient for a Mealy machine to produce a set of given traces. The second technique, FSM minimization, computes a Mealy machine \(\mathcal {M}'\) from a Mealy machine \(\mathcal {M}\), such that \(\mathcal {M}'\) has the minimal number of states and its language is equivalent to \(\mathcal {M}\).

2.2.1 Bounding FSM size

Let \(S \subset (I \times O)^*\) be a finite set of traces and let \(\ll \) be the reflexive prefix relation on traces. To compute an upper bound on the number of FSM states sufficient to produce S, we create a prefix-tree acceptor (PTA) from S and use its number of nodes as a bound for the number of FSM states that are sufficient to produce S. PTA creation is a common preprocessing step in automata learning algorithm [18], like RPNI [4]. We create input–output prefix tree acceptors (IOPTAs), which are a variation of PTAs similar to the PTAs used by IoAlergia [19]. An IOPTA T is a tree that compactly represents a trace sample S with edges labeled by inputs and nodes, except the root, labeled by outputs. Hence, a path from the root to a node of T is labeled by a trace \((I \times O)^*\).

An IOPTA T created for a trace sample S contains a path labeled by a trace t iff S contains a trace \(t'\) with \(t\ll t'\), i.e., T contains a path for every trace prefix. Thus, T can be created from S by merging traces with common prefixes. Such an IOPTA T is a partialFootnote 1 Mealy machine that defines exactly the prefix-closure of S. We can deduce that the number of nodes of T is a bound on the number of FSM states sufficient to represent S. Since languages defined by a Mealy machine are prefix-closed, IOPTA computation does not introduce any generalization.

Fig. 3
figure 3

IOPTA representing traces sampled from the ping server

Example 2

(IOPTA of ping server) Suppose we sampled a set of three traces that includes:

  • \( Connect / ConnAck \cdot Connect / ConnectionClosed \) \(\cdot \) \( Connect / ConnAck \)

  • \( Connect / ConnAck \cdot Ping / Pong \cdot Ping / Pong \)

  • \( Ping / ConnectionClosed \cdot Connect / ConnAck \cdot Ping / Pong \)

The corresponding IOPTA is shown in Fig. 3, where outputs are put in curly braces to distinguish them from inputs. The IOPTA has nine nodes; thus, we know that nine states are sufficient to model the ping server.

2.2.2 FSM minimization

Minimization of FSMs basically partitions the states Q of a Mealy machine \(\mathcal {M}\) into blocks B that are equivalent w.r.t. \(\lambda ^*\); see Hopcroft et al. [20] for minimization of DFAs. That is, two states are grouped into a block if they produce the same outputs, and thus cannot be distinguished. Given such a partition B, a minimal Mealy machine \(\mathcal {M}'\) can be constructed with states B, i.e., states given by blocks of indistinguishable states. Transitions between \(b \in B\) and \(b' \in B\) are created if there is a corresponding transition between \(r\in b\) and \(s \in b'\) in \(\mathcal {M}\). Note that \(\mathcal {M}'\) is unique up to a renaming. Active automata learning algorithms, like \(L^*\), have minimality as an inherent property, whereas we apply minimization as an additional step. Efficient algorithms, such as Hopcroft’s FSM minimization algorithm [21] have an \(n \log n \) worst-case runtime complexity. Hence, the runtime overhead of the minimization step is negligible.

Fig. 4
figure 4

Minimal Mealy machine of a ping server

Example 3

(Minimization of Ping Model) The model shown in Fig. 2 is non-minimal. The states \(q_1\) and \(q_2\) are equivalent, as there is no sequence that distinguishes them. Hence, a minimization would create partition \(\{\{q_0\},\{q_1,q_2\}\}\). Based on that, we can create the minimal Mealy machine shown in Fig. 4.

2.3 Automata learning

Automata learning creates behavioral FSMs of black-box systems. Figure 5 illustrates the general framework for learning a reactive system model in the form of a Mealy machine. The goal of automata learning is to create a model \(\mathcal {M}\) such that \(L(\mathcal {M}) = L(\mathcal {M}_\textrm{SUL})\), where \(\mathcal {M}_\textrm{SUL}\) is an unknown Mealy machine representing the System Under Learning (SUL).

Fig. 5
figure 5

The automata learning framework generates a Mealy machine from a sample of traces. The sample is generated from the executions of inputs on the reactive system

We distinguish between active and passive learning algorithms. Passive learning creates a behavioral model from a given set of traces. To learn a Mealy machine \(\mathcal {M}_P\), passive learning infers from a finite set of traces \(S \subset (I \times O)^*\) a model \(\mathcal {M}_P\) such that \(S \subseteq L(\mathcal {M}_P)\), often restricting \(\mathcal {M}_P\) to have at most k states. Given that \(S \subseteq L(\mathcal {M}_\textrm{SUL})\), most algorithms guarantee \(L(\mathcal {M}_P) = L(\mathcal {M}_\textrm{SUL})\) for large enough S and finite \(\mathcal {M}_\textrm{SUL}\) [18]. One challenge in the application of passive learning is to provide a finite set of traces such that \(L(\mathcal {M}_P) = L(\mathcal {M}_\textrm{SUL})\).

Active automata learning queries the SUL to create a behavioral model. Many active learning algorithms are based on the \(L^*\) algorithm [5] which is defined for different modeling formalisms like Mealy machines [22]. \(L^*\) queries the SUL to generate a finite set of traces \(S \subset (I \times O)^*\) from which a hypothesis Mealy machine \(\mathcal {M}_A\) is constructed that fulfills \(S \subseteq L(\mathcal {M}_A)\). \(L^*\) guarantees that the \(\mathcal {M}_A\) is minimal. The hypothesis \(\mathcal {M}_A\) is then checked for equivalence to the language \(L(\mathcal {M}_\textrm{SUL})\). Since \(\mathcal {M}_\textrm{SUL}\) is unknown, checking the behavioral equivalence between \(\mathcal {M}_\textrm{SUL}\) and \(\mathcal {M}_A\) is generally undecidable. Hence, conformance testing is used to substitute the equivalence oracle in active learning. Model-based testing techniques generate a finite set of traces \(S_\mathcal {T} \subset (I \times O)^*\) from executions on \(\mathcal {M}_A\) and check if \(S_\mathcal {T} \subset L(\mathcal {M}_\textrm{SUL})\). If \(t(\textbf{i}, \textbf{o}) \notin L(\mathcal {M}_\textrm{SUL})\), a counterexample to the behavioral equivalence between \(\mathcal {M}_\textrm{SUL}\) and \(\mathcal {M}_A\) is found. Based on this trace, the set of traces \(S \subset (I \times O)^*\) is extended by performing further queries. Again a hypothesis \(\mathcal {M}_\textrm{A}\) is created and checked for equivalence. This procedure repeats until no counterexample to the equivalence between \(L(\mathcal {M}_\textrm{SUL})\) and \(L(\mathcal {M}_\textrm{A})\) can be found. The algorithm then returns the learned automaton \(\mathcal {M}_\textrm{A}\). Note that \(L^*\) creates \(\mathcal {M}_\textrm{A}\) such that \(S \subset L(\mathcal {M}_\textrm{A})\). With access to a perfect behavioral equivalence check, which provides any differences between the languages defined by \(\mathcal {M}_\textrm{SUL}\) and \(\mathcal {M}_A\), we could guarantee that the generated finite set of traces S enables learning a model \(\mathcal {M}_A\) such that \(L(\mathcal {M}_A) = L(\mathcal {M}_\textrm{SUL})\).

3 Automata learning with RNNs

In this section, we first present the problem that we tackle and propose an RNN architecture as a solution. After that, we cover (i) the constrained training of the proposed RNN architecture with our specific regularization term, and (ii) the usage of the trained RNN to extract an appropriate automaton. Finally, we propose an iterative learning algorithm that uses the proposed RNN architecture and automaton extraction to learn a minimal automaton without knowing the minimal number of states.

3.1 Overview and architecture

It is well known that recurrent neural networks (RNNs) can be used to efficiently model time-series data, such as data generated from the interaction with a Mealy machine. Concretely, this can be done by using the machine inputs \(x^{<t>}\) as inputs to the RNN and minimizing the difference between the machine’s true outputs \(y^{<t>}\) and the RNN’s predictions \(\hat{y}^{<t>}\). In other words, the RNN would predict the language \(L(\mathcal {M})\) of a Mealy machine \(\mathcal {M}\).

Fig. 6
figure 6

RNN-cell architecture

This optimization process can be performed via gradient descent. Even if such a trained RNN can model all interactions with perfect accuracy, one disadvantage compared to the native automaton representation as, e.g., a Mealy machine, is that it is much less interpretable. While each state in a Mealy machine can be identified by a discrete number, the hidden state of the RNN, which is the information passed from one time step to the next one, is a continuous real-valued vector. This vector may be needlessly large and contain mostly redundant information. Thus it would be useful if we could simplify such a trained RNN into a Mealy machine \(\mathcal {M}_{R}\) that produces the language of a Mealy machine \(\mathcal {M}\) that we want to learn, i.e., with \(L(\mathcal {M}) = L(\mathcal {M}_R)\).

We approach the following problem. Given a sample \(S \subset L(\mathcal {M})\) of traces \(t(\textbf{i}_j,\textbf{o}_j)\) and the number of states k of \(\mathcal {M}\), we train an RNN to correctly predict \(\textbf{o}_j\) from \(\textbf{i}_j\). To facilitate interpretation, we want to extract a Mealy machine \(\mathcal {M}_R\) from the trained RNN with at most k states, modeling the same language. For \(\mathcal {M}_R\), \(S \subset L(\mathcal {M}_R)\) shall hold such that for large enough S we have \(L(\mathcal {M}) = L(\mathcal {M}_R)\).

For this purpose, we propose an RNN architecture and learning procedure that ensure that the RNN hidden states can be cleanly translated into k discrete automata states. Compared to standard vanilla RNNs, the hidden states are transformed into an estimate of a categorical distribution over the k possible automaton states. This restricts the encoding of information in the hidden states since now all components need to be in the range [0, 1] and sum up to 1. Figure 6 shows our complete RNN cell architecture for a single hidden layer, implementing the following equations.

$$\begin{aligned} h^{<t>}&= af\left( W_{hx} x^{<t>} + W_{hs} s^{<t-1>} + b_h\right) , \end{aligned}$$
(1)
$$\begin{aligned}&\qquad af\!\in \!\{\hbox {ReLU}, \hbox {tanh}\} \nonumber \\ \hat{y}^{<t>}&= \hbox {softmax}\left( W_y h^{<t>} + b_y\right) \end{aligned}$$
(2)
$$\begin{aligned} \hat{s}^{<t>}&= \hbox {softmax}\left( W_s h^{<t>} + b_s\right) \end{aligned}$$
(3)
$$\begin{aligned} s^{<t>}&= \left\{ \begin{aligned}&\hbox {softmax}\left( W_s h^{<t>} + b_s\right) \\&\quad \ldots \text {if mode}=``{} \texttt {train}\hbox {''}\\&\hbox {hardmax}\left( W_s h^{<t>} + b_s\right) \\&\quad \ldots \text {else (i.e., mode}=``{} \texttt {infer}\hbox {''}) \end{aligned} \right. \end{aligned}$$
(4)

In comparison with vanilla RNN cells, the complete hidden state \(h^{<t>}\) is only an intermediate vector of values. Based on \(h^{<t>}\), an output \(\hat{y}^{<t>}\) is predicted using a softmax activation. A Mealy machine state \(\hat{s}^{<t>}\) is predicted as well and passed to the next time step. It is computed via (i) softmax during RNN training, and (ii) via hardmax during inference. During training, see Algorithm 2, we also compute the cross-entropy of \(\hat{s}^{<t>}\) with \(hardmax(\hat{s}^{<t>})\) as a label, which serves as a regularization term. Inference refers to extracting an automaton from the trained RNN, which takes as input the current system state and an input symbol and gives as output the next system state and an output symbol. Hence, we use softmax to estimate a categorical distribution over possible states for training, whereas we use hardmax to concretely infer one state during inference.

Our algorithm for extracting a Mealy machine from a trained RNN, see Algorithm 3, is based on the idea that if the RNN achieves perfect accuracy when predicting the machine’s true outputs, the hidden state \(h^{<t>}\) encodes information corresponding to the state of a Mealy machine at time step t. Otherwise, the RNN would not be able to predict the expected outputs correctly, since those are a function of both the input and the current state. By adapting the RNN architecture, we enforce hidden states to correspond to discrete Mealy machine states.

3.1.1 Multiple hidden layers

The matrices \(W_{hx}\), \(W_{hs}\), \(W_y\), and \(W_s\) and the corresponding bias vectors introduced before define the weights of an RNN with a single layer. It may be beneficial to add additional hidden layers to better predict certain complex behaviors. For the hidden layers, we use fully connected layers, where each layer i is defined by a pair \({W_{hh}}_i\) and \({b_{h}}_i\) containing the weights of that layer. Each layer performs an additional transformation of the hidden state \(h^{<t>}\).

Concretely, the one-hot-encoded input \(x^{<t>}\) and state \(s^{<t-1>}\) are first mapped to \(h^{<t>}\), that is, Eq. 1 is left unchanged. Let \(h^{<t>}_0 = h^{<t>}\), then every additional layer performs the transformation \(h^{<t>}_i = af({W_{hh}}_i h^{<t>}_{i-1} + {b_{h}}_i)\). When we have multiple layers, we perform the state and output prediction on the result \(h^{<t>}_k\) from the last hidden layer k, that is, we substitute \(h^{<t>}\) by \(h^{<t>}_k\) in Eqs. 23 and 4. Hence, input processing is carried out only by the first layer, and state and output predictions, along with their corresponding regularization, are performed exclusively in the last layer. All transformations in between are not affected by regularization.

3.2 Training and automaton extraction

In the following, we first discuss how to train an RNN with the structure shown in Fig. 6 such that it will encode an automaton. Secondly, we show how to extract the automaton from a trained RNN. We start by illustrating the basic operation of such an RNN, i.e., the prediction of outputs and state transitions from an input sequence. This is called the forward pass and is used during training and automaton extraction.

Forward pass. Algorithm 1 implements the forward pass taking an input sequence \(\textbf{x}\) and a mode variable as parameters. The mode variable distinguishes between training (train) and automaton extraction (infer). The algorithm returns a pair \((\hat{\textbf{y}}, \hat{\textbf{s}})\) comprising the predicted output sequence and the sequence of hidden states visited by the forward pass. We want to learn the language of a Mealy machine, i.e., map \(\textbf{i} \in I^*\) to \(\textbf{o} \in O^*\) for sets IO of input and output symbols. Therefore, we encode every \(i \in I\) using a one-hot-encoding to yield input sequences \(\textbf{x}\) from \(\textbf{i} \in I^*\). In this encoding, every i is associated with a unique \(\vert I\vert \)-dimensional vector, where exactly one element is equal to one and all others are zero. We write x for a one-hot-encoded input i. Analogously, we encode outputs and the hidden state shall approach a one-hot-encoding in a k-dimensional vector space. For one-hot-encoded outputs, we generally use the letter y and we use D to denote one-hot-encoded training datasets derived from a sample \(S \subset L(\mathcal {M})\).

Algorithm 1
figure f

Model forward pass M(xmode)

Algorithm 1 initializes the output and state sequences \(\hat{\textbf{y}}\) and \(\hat{\textbf{s}}\) to the empty sequences and the hidden state s of the RNN to the one-hot encoding of the fixed initial state \(q_0\). For every input symbol \(x^{<t>}\), Lines 4–12 perform the equations defining the RNN, i.e., applying affine transformation using weights and an activation function. At each step t, we compute and store the predicted output \(\hat{y}^{<t>}\) (Line 6) and the predicted state \(\hat{s}^{<t>}\) (Line 7) in \(\hat{\textbf{y}}\) and \(\hat{\textbf{s}}\), respectively. In the “train” mode, we pass \(\hat{s}^{<t>}\) as hidden state to the next time step (Line 9). In the “infer” mode used for automaton extraction, we apply a hardmax on the hidden state (Line 11) so that exactly one state is predicted.

Training. The architecture is trained by minimizing a prediction loss between \(y^{<t>}\) and \(\hat{y}^{<t>}\) along with a regularization loss: the cross-entropy of the state distribution \(s^{<t>}\) w.r.t. to the state with the highest probability in \(s^{<t>}\). On the one hand, our regularization design encourages the RNN to re-use a state at subsequent steps once it has been selected at the current step since this contributes to decreasing the regularization loss. On the other hand, it encourages using as few states as possible overall since any additional used state contributes to increasing the regularization loss. Minimizing our regularization of choice forces the RNN to increase the certainty about the predicted state. This ensures that the hidden states tend to be approximately one-hot-encoded vectors where the index of the maximal component corresponds to the state of a Mealy machine accepting the same language. Note that directly using a discrete state representation is not beneficial when training with gradient descent. Algorithm 2 implements the training in PyTorch-like [23] style. Its parameters are the training dataset D, a sample of the language to be learned, the learning epochs, and a regularization factor, which controls the influence of state regularization. The training is performed using the gradient descent-based Adam optimizer [24]. The algorithm performs up to \(\#epochs\) loops over the training data. An epoch processes each trace in the training data referred to as an episode (Lines 4–21). For the actual training, we perform a forward pass in “train” mode and compute the overall loss from the prediction and state regularization losses (Lines 5–9). Lines 10 and 11 update the RNN parameters, i.e., the weights.

Training stops when the prediction accuracy of the RNN operated as an automaton reaches 100% or \(\#epochs\) episodes have been performed. To calculate accuracy, we perform a forward pass in the “infer” mode in Line 15 and compute the average accuracy in Line 16. Upon finishing the training, Algorithm 2 returns a Boolean variable indicating if the prediction accuracy converged to \(100\%\) and the trained RNN.

Algorithm 2
figure g

RNN Training train(MD)

The purpose of the trained RNN model is not to predict outputs of new inputs, unseen during training, but to help with inferring an automaton that produces the training data. This automaton shall be used to predict the outputs corresponding to (new) inputs. Thus, we use all available data for training the RNN and aim at achieving perfect accuracy on the training data. Perfect accuracy on the training set gives us the confidence that the internal state representation of the learned RNN model corresponds to the true (partial) automaton that produced that data. In cases where the training data does not cover all states and transitions of the full true automaton, we might learn a partial automaton missing some states and transitions. Using all available data for training reduces the possibility to learn just a partial automaton.

Automaton extraction from a trained RNN. Given a trained RNN model, we extract the corresponding automaton with Algorithm 3. We represent the automaton of a Mealy machine by its set of transitions in the following form:

$$\begin{aligned} T&= \{(s, s', i/o) \mid s, s' \in Q \wedge i \in I \wedge o \in O \wedge \\&\quad \delta (s, i) = s' \wedge \lambda (s, i) = o \}. \end{aligned}$$
Algorithm 3
figure h

Automaton extraction from RNN extract(MD)

Algorithm 3 starts by initializing T to the empty set. Then, it iterates through all episodes, i.e., all traces, from the training set D. At each iteration (Lines 3–13) it first runs the RNN model M on the one-hot-encoded input sequence \(\textbf{x}\) of the current episode (Line 3) to obtain the corresponding predicted output symbols and transition state sequence \(\hat{\textbf{y}}\) and \(\hat{\textbf{s}}\), respectively. For this purpose, we perform the forward pass implemented by Algorithm 1 in the “infer” mode. This mode uses the \( hardmax \) operation to compute the encoded state in each step to ensure stability of the extraction process. A well-trained RNN with our architecture will traverse states that are close to being one-hot-encoded. The states are generally not perfectly one-hot-encoded due to numerical imprecisions and the nature of RNN training. Using \( hardmax \) to compute state suppresses the accumulation of such small imprecision. This is especially relevant when processing long training episodes during automaton extraction. Using the \( softmax \) operation, we might get ambiguous extraction results, which manifest as non-deterministic transitions.

Lines 6–13 iterate through all steps of the current episode. All episodes start from the initial state \(q_0\) which, by construction, is assigned the label 0. Thus, we initialize the first state to 0 (Line 4). If the predicted output symbol matches the label at the current step (Line 8), then T is extended by a triple encoding a transition, which is built from the source/target states and the input/output symbols of the current step. By applying argmax on the one-hot encoded input \(x^{<t>}\) and output \(y^{<t>}\) we get integer-valued discrete representations of them (Line 7). The actual corresponding input symbol, and respective output symbol, are obtained from the input, respective output, and symbol alphabet through an appropriate indexed mapping. For simplicity, we do not show this mapping here. If the predicted output does not match the expected value, the current and remaining steps of the current episode are ignored and the algorithm moves to the next episode (Line 2). An episode consists of a sequence of adjacent steps (or transitions) in the automaton, i.e., the next step starts from the state where the current step ended (Line 13). After processing all training data traces, Algorithm 3 returns the extracted automaton with transitions T. Note that the extracted automaton might not include all states that can be one-hot encoded with a vector with the length of \(s^{<t>}\). Hence, the number of states of the extracted automaton might be smaller than the length of \(s^{<t>}\).

3.3 Minimal automaton learning

In the following, we explain how the previously proposed algorithms can be used to learn a minimal automaton. In this case, we assume that we do not know in advance the exact number of states of a minimal automaton. Using an iterative approach, we adjust downward the upper bound on the maximal number of states allowed to create an automaton that correctly models the behavior of the given dataset. This iterative approach is repeated until the number of states cannot be reduced any further.

Algorithm 4 describes our iterative learning approach. For learning, we require a dataset D, which is a sample of the language to be learned, and a specific learning strategy. We distinguish between two learning strategies: best effort (“bestEffort”) and exhaustive (“exhaustive”). In the best-effort strategy, a learning iteration (Lines 5–22) ends upon the first successful attempt, if any, to learn an automaton which perfectly explains the training data (Line 19). In contrast, the exhaustive strategy performs all \(\#runs\) attempts to learn an automaton that is correct to the dataset and selects among them the minimal one, if any, for further processing. Consequently, the exhaustive strategy tries within the given budget of \(\#runs\) to find an even smaller automaton with the known upper bound of the current learning iteration, even if an automaton with the current upper bound has already been found. That is, the exhaustive strategy always executes all \(\#runs\) iterations of the for-loop (Line 7), whereas the best-effort strategy at most \(\#runs\). Algorithm 4 returns a triplet consisting of (i) the best learned automaton, (ii) a Boolean variable confirming that, at the last learning iteration, an automaton of the same size has been learned as in the previous iteration, meaning that no smaller automaton could be further learned, and (iii) the number of performed learning iterations. The best learned automaton corresponds to an automaton with the fewest number of states learned in all iterations. The Boolean variable indicates whether the RNN model learned the same smallest automaton after setting the upper bound to the minimal number of states found so far.

Algorithm 4
figure i

Minimal automaton learning fixpoint(Dstrategy)

Algorithm 4 starts by creating an IOPTA from the given dataset D, as described in Sect. 2.2, and initializing the number of learning iterations \(\#it\) to 0. The generated tree represents the initial automaton and provides the first upper bound on the number of states based on the number of nodes in the tree. We then start our iterative learning procedure. A learning iteration represents one iteration of the repeat-loop and includes the block from Lines 5–22.Footnote 2 We terminate the learning procedure if the learned automaton cannot be further minimized.

Let \(states_n(A)\) be a function that returns the number of states of a given FSM A. In Line 5, we save in \(states_{\textrm{min}}\) the states number of the currently best learned automaton \(A_{\textrm{min}}\), i.e., the automaton with the fewest states learned so far. In the next step, we initialize a Boolean variable \(approved_{\textrm{min}}\) to false, represented by \(\bot \). This variable indicates whether the RNN model learns again an automaton with the number of states equal to \(states_{\textrm{min}}\) after setting the upper bound of states to \(states_{\textrm{min}}\) (Line 8) at the current iteration. In the following (Lines 7–22), we attempt to reduce the number of states further to minimize the automaton \(A_{\textrm{min}}\) with our RNN-based learning technique.

The attempt to minimize the automaton is limited to a maximum number of runs. In Line 8, we initialize our RNN model M based on the current upper bound on the number of states. In terms of the RNN cell architecture we introduced in Sect. 3.1, the given number of states defines the size of the state vector \(s^{<t>}\). We then train the RNN model M on the provided dataset as described in Algorithm 2 (Line 9). Hence, we obtain the trained RNN model M and a Boolean variable indicating whether the trained model achieved \(100\%\) accuracy in predicting the outputs of the provided dataset D. If the RNN model converges to \(100\%\) accuracy, from Lines 11–20, we (i) extract the automaton, (ii) check the automaton size, and (iii) stop trying to learn further automata with potentially fewer number of states if the best-effort strategy was selected.

(i) First, we extract the automaton A as described in Algorithm 3. Since we do not know the exact number of states, we may learn an automaton that has more states than the minimal automaton representing the dataset D. The RNN model might extract an automaton that has states that cannot be distinguished. As described in Sect. 2.2, we can further minimize an automaton by grouping indistinguishable states. For this, we merge all indistinguishable states of the extracted automaton A and create an equivalent automaton containing only distinguishable states in Line 12. Note that any kind of minimization on A does not affect the behavior learned by the RNN model.

(ii) We then compare the size of the minimized automaton A and the size of the current minimal automaton \(A_{\textrm{min}}\) (Line 13). If the newly extracted automaton A has fewer states than the previously learned minimal automaton \(A_{\textrm{min}}\), A becomes the new minimal automaton. In the case that we cannot further reduce the automaton size, we set \(approved_{\textrm{min}}\) to true, represented by \(\top \), indicating that the fixpoint has been reached (Line 16).

Table 1 Description of Tomita grammars

(iii) If we use the best-effort strategy, we do not execute the remaining runs of the current learning iteration after learning the first automaton perfectly explaining the training data (Line 19). In the exhaustive strategy, on the other hand, we continue the current learning iteration until the entire budget \(\#runs\) is consumed and train new RNN models with the minimum number of states from the previous learning iteration as upper bound for the size of the state vector \(s^{<t>}\). After the maximum number of runs has been performed, \(A_{\textrm{min}}\) contains the best among all automata, if any, learned during the \(\#runs\) attempts to learn an automaton perfectly explaining the training data.

Finally, Algorithm 4 terminates if at the current learning iteration (i) either the fixpoint is reached, i.e., an automaton with the same number of states as in the previous learning iteration is learned, or (ii) no automaton at all perfectly explaining the training data can be learned due to insufficient budget. The algorithm returns the best learned automaton, if any, otherwise IOPTA(D), along with the information whether the fixpoint has been reached and the number of performed learning iterations.

4 Case studies

4.1 Case study subjects

4.1.1 Tomita grammars

We use Tomita grammars [25] to evaluate our approach. These grammars are popular subjects in the evaluation of formal-language-related work on RNNs [26,27,28], as they possess various features, while they are small enough to facilitate manual analysis. All of the grammars are defined over the input symbols 0 and 1. We transformed the ground-truth DFAs into Mealy machines, thus the outputs are either \( true \) (string accepted) or \( false \) (string not accepted). Table 1 contains for each Tomita grammar a short description of the accepted strings and the number of states of the smallest Mealy machine accepting the corresponding language. For example, Tomita 5 accepts strings depending on parity of 0 and 1 symbols. The language described by Tomita 5 has been used to illustrate the \(L^*\) algorithm [5]. Automata accepting such languages are hard to encode using certain types of RNNs [29].

4.1.2 Bluetooth Low Energy (BLE)

To evaluate the applicability to practical problems, we consider the BLE protocol. BLE was introduced in the Bluetooth standard 4.0 as a communication protocol for low-energy devices. The BLE protocol stack implementation is different from the Bluetooth classic protocol. Pferscher and Aichernig [30] learned with \(L^*\) behavioral models of BLE devices. They presented practical challenges in the creation of an interface to enable the interaction required by active automata learning. Especially, the requirement of adequately resetting the device after each performed query raises the need for a learning technique that requires less interaction with the SUL. We selected three devices from their case study. The selected devices have a similarly large state space and show more stable deterministic behavior than other devices in the case study by Pferscher and Aichernig [30] which would have required advanced data processing that filters out non-deterministic behavior. Table 2 states the investigated devices, the used System-on-Chip, and the running application. In the following, we refer to the devices by the System-on-Chip names. The running application initially sends BLE advertisements and accepts a connection with another BLE device. If a connection terminates, the device again sends advertisements. The generated behavioral model should formalize the implemented connection procedure. Compared to existing work [30], we extended the considered nine inputs by another input that implements the termination request, which indicates the termination of the connection by one of the two devices. Since every input must be defined for every state, the complexity of learning increases with the size of the input alphabet. Hence, the BLE case study provides a first impression of the scalability of our presented learning technique.

Table 2 Investigated Bluetooth Low Energy (BLE) devices including the running application

Figure 7 depicts a behavioral model of the CYBLE-416045-02. For the illustration, some input and output labels have been simplified and combined by a ‘+’-symbol. The model shows that a connection can be established with a connection request and terminated by a scan or termination request. A version request is only answered once during an active connection. Pferscher and Aichernig [30] provide a link to complete models of all three considered examples.

Fig. 7
figure 7

Simplified model of the CYBLE-416045-02 (‘\(+\)’ abbreviates inputs/outputs)

4.2 Experimental setup

We demonstrate the effectiveness of our approach on both (i) the canonical Tomita grammars from the literature [26, 28, 31], and (ii) the physical BLE devices that were introduced in the previous section. Both evaluations (i) and (ii) are performed with AAL data and successively with randomly generated data.

We consider the automata learned with the active automata learning (AAL) algorithm \(L^*\) and the corresponding data produced by AAL as given. We call these the AAL automata and AAL data, respectively. In general, we do not require AAL to be executed in advance. AAL rather provides an outline for the evaluation of our proposed RNN architecture.

Our case study is based on the following four experimental setups: (i) We first evaluate the capability of our RNN architecture to learn the correct automaton when the number of states of the AAL automaton is known in advance. This number of states k is used to set the size of \(s^{<t>}\) in the RNN architecture. The AAL automaton itself is only used as ground truth. It does not affect the RNN training procedure in any way other than defining the size of \(s^{<t>}\). We say that the RNN learned the correct automaton if the automaton extracted from the trained RNN according to Algorithm 3 is isomorphic to the minimal ground-truth automaton.

(ii) We then evaluate the capability of our approach to learn the correct automaton without making any assumption on the number of automaton states other than a data-based upper bound, as shown in Algorithm 4. We fix \(\#runs = 10\) and perform a statistical evaluation by running 10 times Algorithm 4 for each use case. For simplicity, we use the same values from (i) for the RNN architecture and training hyperparameters (e.g., number of hidden layers and neurons per layer, activation function, learning rate, regularization factor, etc.), except the size of the state vector \(s^{<t>}\) which is now controlled by the algorithm itself. In practice, these hyperparameters can be tuned by running Algorithm 4 with different hyperparameter values and selecting those with better convergence properties.

(iii) Furthermore, we evaluate the effects of changing RNN hyperparameters.

(iv) Finally, we compare our proposed RNN-based learning technique with a classic passive learning technique from the literature. To enable a fair comparison, we again use the randomly generated data.

AAL Data. Firstly, we use the AAL data as RNN training data. This finite set of traces from AAL is complete in the sense that passive automata learning could learn a behavioral model with k states that conforms to the model learned by AAL. For AAL data generation, we used the active automata learning library AALpy [6], which implements state-of-the-art algorithms including the \(L^*\)-algorithm variant for Mealy machines by Shahbaz and Groz [22]. The logged data include all performed output queries and the traces generated for conformance testing during the equivalence check. The model-based testing technique used for conformance testing provides state coverage for the intermediate learned hypotheses.

For the BLE data generation, we use a similar learning framework as Pferscher and Aichernig [30]. To collect the performed output queries during automata learning, we logged the performed BLE communication between the learning framework and the SUL. The logged traces are then post-processed to exclude non-deterministic traces. Non-determinism might occur due to packet loss or delayed packets. In this case, the active automata learning framework repeated the output query. To clean up the logged BLE traces, we execute all input traces on the actively learned Mealy machine. If the observed output sequence deviates from the Mealy machine output, the trace is removed from the considered learning dataset.

Random Data. Secondly, we use randomly generated data as training data. That is, we are not guided by any active learning procedure to generate the training data. Instead, we simply sample random inputs from the input alphabet and observe the outputs produced by the system, i.e., the Tomita grammars or the physical BLE devices. This corresponds to a more realistic real-world scenario where the data logged during regular system operation is the only available training data. To speed up the experiments, we use the AAL automaton instead of the real system to generate random data. More precisely, we achieve this through random walks on the AAL automaton. Each random walk represents a trace in the training data. It always starts from the initial state of the AAL automaton and collects the sequence of input–output pairs obtained by running the AAL automaton on the randomly generated inputs. We set a value \(max\_length\) for the maximal length of the generated traces and the number of traces to be generated.

  • For Tomita grammars, in each iteration, we produce a trace through a random walk from the initial state with a length uniformly distributed within \([1, max\_length]\). We add the produced trace to the dataset if it was not already generated before.

  • For the BLE devices, we generate traces that simulate BLE sessions between real-world devices. For this, each trace ends with a terminate request indicating the end of the connection.

    Hence, we can extract such traces from a long random walk by extracting the subtraces between two subsequent terminate requests. Thus, we start a random walk from the initial state. At each step, we sample an input request or force a terminate request to ensure a maximum individual trace length of \(max\_length\).

    Every time we return to the initial state, we add the corresponding generated trace to the dataset, if not already contained. We start a new random walk and iterate as long as the dataset does not contain the desired number of traces.

    Since each episode ends in the initial state due to the final terminate request, we exploit this knowledge during the RNN training by adding to the overall loss (Algorithm 2, Line 9) the term \(cross\_entropy(q_0,s^{<last>})\) corresponding to the deviation of the last RNN state \(s^{<last>}\) from the initial state of the learned automaton, which is fixed to \(q_0\) by construction.

We start with a smaller random dataset and progressively generate bigger random datasets until the RNN learns the correct automaton or a predefined time budget is consumed.

All experiments were performed with PyTorch 1.8. The evaluation (i) was performed on a Dell Latitude 5501 laptop with Intel Hexa-Core I7-9850H, 32 GB RAM, 2.60 GHz, 512 GB SSD, NVIDIA GeForce MX150, and Windows-10 OS. The evaluations (ii) and (iii) were performed due to the increased amount of RNN training sessions on a scientific cluster based on Intel(R) Xeon(R) Gold 6230R CPU @ 2.1 GHz and Ubuntu 20.04. The evaluation (iv) was performed on an Apple MacBook Pro 2019 with an Intel Quad-Core i5 running at 2.4 GHz and with 8 GB memory.

4.3 Results and discussion

The following section presents the results for our four experimental setups, followed by a discussion on the generalizability of our presented learning technique to other case studies.

4.3.1 Minimal states number is given

In this evaluation, we assume that the number of states of a minimal automaton is given. This should demonstrate the basic feasibility of our proposed RNN architecture.

Tables 3 and 4 illustrate the experimental results obtained by applying our approach to learn the automata of Tomita grammars and BLE devices, respectively, from both AAL data and random data. Compared to the same tables from our conference paper [15], we present different numbers here since we repeated the experiments under slightly different conditions for the following reasons:

  • We adapted the random data generation procedure to provide exactly the desired number of unique episodes, instead of just removing the duplicates after the episodes were generated and providing the remaining episodes.

  • We enhanced some procedures to support the new Algorithm 4 which is introduced in this article.

The number of traces contained in the training data is given in the column Size for both AAL data and random data. For the random data, it is interesting to know how many traces from the AAL data were contained also in the random data. This information is shown in the column AAL Data Coverage as the ratio between the number of AAL traces contained in the random data and the overall number of traces in the AAL data.

The column Episode Lengths contains the means and standard deviations of the lengths of the traces in the training data. The column RNN contains (i) the RNN architecture parameters which possibly changed across the experiments (i.e., the activation function af and the number of hidden layers #hl in Table 3 and the regularization factor C in Table 4), and (ii) the number of epochs #e and the time t required by the RNN training to learn the correct automaton.

The values of other RNN architecture parameters, which were the same in all experiments, are mentioned in the table captions. For instance, it turned out that the values 0.001 and 256 for the learning rate and the number of neurons per hidden layer, respectively, worked for all considered case studies.

For the Tomita grammars (Table 3), the value of the regularization factor C was also fixed and equal to 0.001, whereas different activation functions and numbers of hidden layers were used across the different grammars. For the BLE devices (Table 4), the number of the hidden layers and the activation function were also fixed and equal to 1 and ReLU, respectively, whereas different regularization factors were used across different devices.

Table 3 RNN automata learning of Tomita grammars
Table 4 RNN automata learning of BLE devices

The results show that we could find an appropriate architecture to learn the correct automata in all considered cases, meaning that the RNN accuracy was 100% in all cases. This was expected when learning from AAL data, as the AAL data were sufficient for the \(L^*\) algorithm to learn the underlying minimum automaton. More surprising is that we could learn for all examples the correct automaton also from relatively small random datasets with a low coverage of the AAL data. Even more surprising is that for all Tomita grammars, except Tomita 2 and 3, we could significantly shrink the random datasets compared to the AAL dataset and still learn correctly. The datasets became smaller in terms of both number of traces and average trace length. Moreover, only a small fraction of the AAL data happened to be included also in the random data. This suggests that the proposed RNN architecture and training may better generalize on sparser datasets than AAL. The good performance on Tomita grammars might be attributed to the small number of automaton transitions and input/outputs alphabets that only consist of ‘0’ and ‘1’ symbols.

For the BLE device CYBLE-416045-02, which has much larger input and output alphabets, we could still learn the correct automaton from a random dataset containing fewer traces than the AAL data. The other two BLE devices required larger random datasets due to the higher number of transitions to be covered.

For all case studies, except Tomita 7 and 6, the same RNN architecture worked for both AAL and random data. For Tomita 7, ReLU worked for the AAL data, whereas tanh worked better for the random data. For Tomita 6, two hidden layers worked better for the random data, as opposed to a single hidden layer. Typically, the tuning process of RNN hyperparameter values depends on the dataset, and each case study involves a distinct dataset. When learning from random data, we also attempted to learn from datasets as small as possible. In the case of Tomita 6 and 7, this necessitated different values for some hyperparameters compared to when learning from AAL data.

4.3.2 Minimal states number is learned

In the following, we present the results of the evaluation for learning the minimal ground-truth automaton from scratch with Algorithm 4, i.e., without knowing in advance the minimal number of states. In the following, we denote an automaton that is isomorphic to the minimal ground-truth automaton as the minimal AAL automaton.

We define an ‘experiment’ as the endeavor to learn a minimal automaton by executing Algorithm 4, with the following components: (i) a training dataset, such as the AAL dataset or randomly generated data, related to a Tomita grammar or a BLE device, (ii) an RNN architecture specified by parameters like the number of hidden layers, number of neurons, activation function, learning rate, and regularization factor (C), and (iii) a training budget that includes the maximum number of runs at each iteration of the fixpoint computation, the maximum number of training epochs, and, in the case of random data, the number of episodes to be randomly generated and the maximum allowed length for a generated episode. We say that an experiment is approved if the value of the variable \(approved_{min}\) returned by Algorithm 4 is true, i.e., the last fixpoint iteration return as result an isomorphic automaton to the automaton learned in the previous iteration. In general, we follow the ‘best-effort’ strategy if not otherwise stated.

Table 5 Experiments individual results: minimal RNN automata learning of Tomita grammars with #epochs = 100 (strategy = best-effort)

Our minimal automaton learning procedure is inherently stochastic. There are two sources of stochasticity: (i) RNN-related stochasticity (e.g., RNN parameters initialization, samples ordering during training, etc.), and (ii) random generation of training data. To account for the RNN-related stochasticity, we repeat an experiment with the same input ten times. When learning a minimal automaton from random data, the training data are always newly randomly generated in each of the ten experiments.

Table 6 Experiments individual results: minimal RNN automata learning of Tomita 5 and 7 grammars from AAL data with #epochs = 200 (strategy = best-effort)

Tables 569 and 10 show the results individually for all experiments for the Tomita grammars and BLE devices, respectively. An entry in these tables reports following information related to the corresponding experiment of a Tomita grammar, resp. BLE device:

  • overall result: (i) ✓ if a minimal automaton was learned, (ii) ✗ if an automaton was learned but this is not minimal, (iii) if the fixpoint has not been reached within the given budget, i.e., \(approved_\textrm{min} = \bot \) in Algorithm 4, (iv) (✓) if an automaton with the minimal number of states was learned but this is not the same as the minimal AAL automaton due to incomplete randomly generated data,

  • fixpoint convergence details: N\(\xrightarrow {\#it}\)n where N is the upper bound size of the state vector \(s^{<t>}\) (i.e., the maximal number of states that can be learned) initially estimated with IOPTA computation, n is either the states number of the learned automaton or n.a. (not applicable) if no automaton could be learned, and \(\#it\) is the number of the learning iterations which is equal to the number of iterations of the repeat-loop in Algorithm 4,

  • FSM minimization contribution: (i) ‘min(I:i)’: the maximum reduction of states achieved by the FSM minimization approach, as described in Sect. 2.2, during the learning iterations, where I and i (with I > i) are the numbers of states of the extracted automaton from the trained RNN before and after FSM minimization, respectively, (ii) ‘−’: if I = i, i.e., the FSM minimization could not further reduce the number of states of the automaton extracted from the trained RNN at any learning iteration, or (iii) ‘n.a.’: if no automaton could be learned,

  • execution time hh:mm:ss representing the number of hours, minutes and seconds taken to run the experiment.

Tables 7811 and 12 summarize the statistics averaged over the ten experiments performed for each Tomita grammar and BLE device with AAL and random training data. The meaning of the columns is as following:

  1. 1.

    “Autom. Learned”: the percentage of experiments overall where an automaton could be learned, i.e., the percentage of approved experiments,

  2. 2.

    “Learned Autom. is min.”: the percentage over the approved experiments, where the learned automaton is minimal,

  3. 3.

    “Minim. Boost”: the percentage of the experiments overall where the extracted automaton from the trained RNN at some fixpoint iteration could be minimized with the FSM minimization, hence resulting in a reduction of the fixpoint iterations number,

  4. 4.

    “Initial State Vector Size”: the mean and standard deviation over all experiments of the upper bound estimated with IOPTA computation which is used to set the size of the state vector \(s^{<t>}\) in the RNN architecture (Fig. 6) for the first fixpoint iteration,

  5. 5.

    “#Iterations”: the mean and standard deviation over the accepted experiments of the number of iterations required to converge,

  6. 6.

    “Non-min. Autom. #states”: the mean and standard deviation of the learned automaton states number over all accepted experiments where a non-minimal automaton was learned,

  7. 7.

    “Learning Time”: the mean and standard deviation over all accepted experiments of the execution time, both in form of hh:mm:ss.

Table 7 Experiments summary: minimal RNN automata learning of Tomita grammars with #epochs = 100 (strategy = best-effort)
Table 8 Experiments summary: minimal RNN automata learning of Tomita 5 and 7 grammars from AAL data with #epochs = 200 (strategy = best-effort)

Tomita grammars. First, a budget of 100 epochs has been employed for all Tomita grammars. As we can see in Tables 5 and 7, this budget was sufficient, indicated by ✓, to learn the minimal AAL automaton for most Tomita grammars in most experiments, except for Tomita 5 and Tomita 7 with AAL training data. For Tomita 5, we only have one approved experiment where it was also possible to learn the minimal automaton. In the remaining nine experiments, indicated by , already at the first learning iteration, in none of the ten runs we could learn an automaton perfectly explaining the training data, i.e., training the model does not converge to 100% accuracy within the given budget. This strongly suggests that the maximum number of epochs of 100 was too low and stopped the RNN training too early. Similarly for Tomita 7, but less drastically than Tomita 5, there were four unapproved experiments in which no automaton could be learned. We thus increased the maximal number of epochs to 200 and repeated the ten experiments for Tomita 5 and Tomita 7 with the AAL training data. The results reported in Tables 6 and 8 show significant improvements with the increased budget.

All learned automata have the minimal number of states required to model the corresponding ground-truth system. After having increased the maximal training epochs number for Tomita 5 and 7, there were very few unapproved experiments overall. Remarkably, in 9 out of 16 Tomita training setups, the minimal AAL automaton could be learned in all conducted experiments. Moreover, the number of unapproved experiments can be further reduced, if necessary, by further increasing the budget. Importantly, Algorithm 4 directly provides information on whether it is necessary to address an unapproved experiment by increasing the budget. This information is conveyed through the \(approved_{min}\) variable.

It is worth noting that for all Tomita grammars, except Tomita 3, learning a minimal automaton was more efficient with random data than AAL data. A randomly generated dataset which was smaller than the AAL dataset was sufficient to learn the minimal AAL automaton. This suggests that the AAL data, which is an optimal dataset for the \(L^*\) algorithm to learn the minimal automaton, is not necessarily also an optimal dataset for the RNN.

The FSM minimization contributed in most cases to speed up the fixpoint convergence by reducing the number of necessary iterations. In all cases (incl. the ones with no FSM minimization contribution), except one experiment for Tomita 1, the minimal AAL automaton could be learned in the minimum number of fixpoint iterations which is 2. This means that for most experiments, we could learn a minimal automaton within the first fixpoint iteration using a data-based upper bound on the number of states. The second iteration is only required to approve the previous learning result.

Finally, the above results were obtained by employing the best-effort strategy which was sufficient to learn the minimal AAL automaton for all Tomita grammars.

Table 9 Experiments individual results: minimal RNN automata learning of BLE devices with strategy = best-effort (#epochs = 200)
Table 10 Experiments individual results: minimal RNN automata learning of CC2650 device from AAL data with strategy = exhaustive (#epochs=200)
Table 11 Experiments summary: minimal RNN automata learning of BLE devices with strategy = best-effort (#epochs = 200)
Table 12 Experiments summary: minimal RNN automata learning of CC2650 device from AAL data with strategy = exhaustive (#epochs = 200)

BLE devices. We first employed the best-effort strategy for all BLE devices in all experiments. This was sufficient to learn the minimal automaton in most cases, except the CC2650 device with AAL data, as we can see in Tables 9 and 11. Remarkably, only two unapproved experiments occurred overall. For the nRF52832 device in particular, all test setups delivered the minimum AAL automaton.

For the CYBLE-416045-02 and CC2650 devices with random data, there were a few experiments in which an automaton with the same number of states as the ground-truth automaton could be learned but with different transitions. This experiments are indicated by (✓). This suggests that the coverage of the ground-truth automaton was not fully ensured by the randomly generated training data. Increasing the dataset with more randomly generated episodes could be beneficial to solve this issue.

For the CC2650 device, an automaton has been learned which is not minimal, indicated by ✗. Note that the learned automaton still conforms the provided dataset. Learning a non-minimal automaton was more frequently experienced with AAL data. We explain this phenomenon by the difference in the concepts of the used AAL algorithm and our learning technique. The \(L^*\) algorithm starts incrementally increasing the automaton size and stops when no further counterexample to the conformance between the learned automaton and SUL can be found. In contrast, our algorithm incrementally decreases the maximum allowed number of states of the automaton. Let S be the AAL dataset, \(M_A\) be the respective automaton learned by \(L^*\), and \(M_R\) be the automaton learned by our RNN-based technique, then we experience the following relation for the CC2650 device: \(S \subset L(M_A)\) but also \(S \subset L(M_R)\). Since we assume that \(M_A\) correctly represents the SUL, our dataset S misses a trace that shows that \(L(M_A) \ne L(M_R)\). This would be an issue in a generic application scenario where we do not know the ground truth automaton since we would learn a non-minimal automaton without knowing it is not minimal. In practice, we can circumvent this problem by testing conformance between the learned model and the SUL following a conformance testing approach as in active learning. Moreover, employing the more expensive exhaustive strategy improves significantly the probability to fix the non-minimality issue, as we can see by comparing Tables 9 and 10. In fact, with the exhaustive strategy, only in 1 out of 8 approved experiments the learned automaton was not minimal. However, it is worth noting that, whenever a non-minimal automaton was learned, the automaton size was not very different from the minimal automaton size, e.g., 6 or 7 states of the non-minimal automaton versus 5 states of the minimal automaton.

Compared to Tomita grammars, we find that the upper bounds estimated with the IOPTA calculation are much larger. This is due to the larger training datasets in the BLE devices, especially CC2650 and nRF52832 devices. Nevertheless, the fixpoint convergence is still reached in a low number of iterations, mostly 2 or 3 iterations. This is because our RNN architecture, thanks to the way how the regularization was designed, is capable of learning a number of automaton states which is close to the minimal automaton states number even if the size of the state vector \(s^{<t>}\) would allow to learn many more states. However, even if having such large initial size values of the state vector \(s^{<t>}\) does not affect much the functional capability of learning the minimal automaton, this is slowing down the training procedure since the RNN size increases considerably. Finding a better approximation than IOPTA for computing lower upper bounds for the initial size of the state vector \(s^{<t>}\), or simply making a good guess and trying with a much lower initial size, could significantly speed up the learning of the minimal automaton. For example, setting the initial size to 100 would represent a significant reduction (of up to 2 orders of magnitude) compared to IOPTA computation, yet it remains reasonable as it is much higher than the actual minimal number of states. Specifically, for the nRF52832 device, we found that the learning time of the minimal automaton could be reduced by approximately 21% from 00 h:05 m:22 s to 00 h:04 m:15 s with the AAL data and by around 94% from 02 h:59 m:50 s to only 00 h:11 m:03 s with the random data. However, in this paper, we adhere to IOPTA instead of defining ad-hoc fixed upper bounds since IOPTA provides a theoretical guarantee for the upper bound of the minimal number of states.

In summary, whether Algorithm 4 learns a minimal automaton equivalent to the ground truth depends on the training data quality and the ground truth automaton complexity in terms of number of states and transitions and input/output alphabet size. Unapproved experiments are not critical since they are directly signalized by the algorithm and can be mitigated by increasing the epochs budget. If a higher guarantee for automaton minimality is required, the more expensive exhaustive strategy should be used.

Table 13 Change of RNN hyperparameters for Tomita 6 and Tomita 7
Table 14 Change of RNN hyperparameters for Tomita 6 and Tomita 7
Table 15 Change of RNN hyperparameters for Tomita 6 and Tomita 7

4.3.3 Effects of RNN hyperparameter changes

In the following, we illustrate how varying hyperparameter values can influence the results of our RNN-based learning technique.

The hyperparameter values provided in Tables 3 and 4 were determined through a tuning process until we reached a point where we were satisfied with the results. When dealing with AAL data, we explored different hyperparameter values until we could learn the minimal automaton in a reasonable timeframe. For random data, we additionally experimented with varying the size of the dataset, prioritizing learning from smaller datasets. The hyperparameter tuning process was repeated for each dataset size, starting from a size of 10 and gradually increasing it until the automaton could be successfully learned. This explains why we obtained different hyperparameter values for random data compared to AAL data in Table 3 for the Tomita 6 and Tomita 7 grammars. To discuss the influence of the hyperparameters, we examine how learning results change when altering the number of hidden layers and the activation function for Tomita 6 and Tomita 7, respectively. Tables 1314 and 15 illustrate the results with the new RNN architecture compared to the initial RNN architecture, utilizing the same seeds to ensure consistent non-deterministic starting conditions.

Tomita 6. In Table 3 for Tomita 6 with random data, we could successfully learn the minimal automaton from a random dataset as small as containing 20 traces when increasing the number of hidden layers from 1 to 2. In contrast, with a single hidden layer, as used for AAL data, a random dataset including 50 traces would have been needed to learn the minimal automaton. In the following, we will therefore check if learning with two hidden layers also works for AAL data.

As can be seen in Table 13, when the minimum number of states is given, we could still successfully learn the minimal AAL automaton but the execution time almost doubles due to the additional hidden layer, while the number of training epochs remains almost constant. Learning the minimum number of states increases significantly for most experiments the learning time (see Table 15), especially for two experiments where convergence could not be reached (see Table 14). However, the RNN’s performance improved independently, where FSM minimization helped only in 60% of the experiments, as opposed to 100% with the initial RNN architecture. This improvement can be attributed to the increased capacity of the RNN obtained through the additional hidden layer. Although there were two experiments where no convergence was reached at the first iteration, the overall convergence success rate of 80% remained satisfactory, and there was no need to repeat the experiments with an increased budget.

Tomita 7. In Table 3 for Tomita 7 with random data and the minimum states number given, we successfully learned the minimal automaton from a random dataset as small as including 50 traces when switching from the ReLU activation function, which first worked with the AAL data, to tanh (see Table 3). Attempting to learn the automaton from the AAL data with the tanh activation function did not complete within a set time budget of 10 min, which was a magnitude order higher than the learning time achieved with the ReLU activation function. In the following, we analyze the effects on learning the minimal automaton from random data when using the ReLU activation function, as with the AAL data, instead of tanh. Table 13 reveals an increase in the required random dataset size from 50 to 80 traces, which, in turn, explains the increase in AAL data coverage and learning time. When the minimal number of states is also learned (see Tables 14 and 15), there is an improvement in the convergence rate (90% instead of 80%), see Table 15, and in the RNN performance compared to external FSM minimization (0% instead of 40%), attributable to the larger training dataset. The larger training dataset is also responsible for the increased average learning time.

4.3.4 Comparison to a classic learning algorithm

We compared our RNN-based learning technique with a conventional passive learning algorithm from the literature. We chose a variant of the Regular Positive Negative Inference (RPNI) algorithm [4, 18] for learning Mealy machines. RPNI is a passive learning algorithm that is supported by many modern learning libraries such as LearnLib [7] or AALpy [6]. The RPNI variant starts by building an IOPTA from the given dataset, similar to our Algorithm 4. Then, the algorithm merges states based on the given dataset until no more merges are possible. Two states are merged if their future input/output behavior is equivalent. Originally, RPNI requires positive and negative episodes, where positive episodes are included in the language to be learned and negative ones are not. Negative traces are not required for learning Mealy machines, as the states can be distinguished due to the different output behavior for the same input sequences. For consistency with our active-learning-based data generation, we use the implementation of RPNI in AALpy for the performed comparison. Note that the chosen implementation merges states deterministically.

Table 16 RPNI automata learning given the same random datasets used in the RNN-based experiments presented in Tables 5 and 9

Table 16 presents the results of performing RPNI on the same randomly generated datasets as used for the RNN-based experiments in Tables 5 and 9. The numbers of episodes correspond to the numbers given in Tables 3 and 4. For comparison, we measure the conformance between the ground-truth automaton and the learned automaton checking bi-similarity. The conformance metric provides the average percentage of the number of common edges over the union of the edges from both automata. In addition, we report the average number of states of the learned automata, the percentage of automata that have the same number of states as the minimal ground-truth automaton, and the required execution time of the learning algorithm.

Remark

Owing to space constraints, the conformance metric values for the RNN-based experiments were not included in the tables presented in the earlier sections. To facilitate a more comprehensive comparison with the RPNI results showcased in Table 16, we now provide, in Table 17, the average and standard deviation of conformance values per use case. These values correspond to experiments conducted with the random data as outlined in 5 and 9. It is important to note that experiments where the convergence point was not reached, indicated by , have been excluded from consideration.

Table 17 Conformance results from the RNN-based experiments with the random data presented in Tables 5 and 9

The RPNI results indicate differences between the Tomita and the BLE case studies. For the Tomita case study, we find that RPNI achieves quite high conformance scores, with only a few exceptions where a slightly larger automaton is learned. This indicates that the generated random datasets sufficiently cover the behavior of the underlying Tomita case study. In contrast, the results for the BLE case study are different. The automata learned with RPNI achieve significantly lower conformance with the ground-truth automata. Moreover, RPNI failed to learn the minimal automaton in the majority of experiments. These results emphasize that randomly generated datasets might not be complete, meaning that not every input and output is defined for every state. For the BLE case study, sparsity in the dataset is expected due to the larger size of the input and output alphabet. This shows that our RNN-based technique can generalize well enough on incomplete datasets and learns more conforming automata than RPNI. Regarding execution time, we observe that state merging requires less time than training an RNN.

For the case studies considered in this paper, a comparison of Tables 16 with Tables 7 and 11 indicates that our RNN-based learning approach is more favorable than a traditional passive automata learning algorithm like RPNI, when utilizing random data or data collected during the regular operation of the SUL. If there is the option of a reliable online interaction with the SUL, it is advisable to opt for a classical active automata learning algorithm such as \(L^*\) for effectively learning the minimal automaton of the SUL.

4.3.5 Generalizability

We evaluated the feasibility of learning minimal FSMs of relatively small sizes from samples of Tomita regular languages and BLE devices. It remains to be proven whether our results generalize to more complex languages that require larger FSMs. However, our main goal was to propose an architecture that can accurately capture discrete dynamics. The languages from our case studies possess various different features, like modular arithmetic in Tomita 5. Hence, we argue that the architecture generalizes well to different languages of comparable discrete complexity. While interesting, it is not our foremost goal to scale to large discrete systems. Since RNNs are generally well-suited to tasks, like time-series forecasting, we rather plan to extend RNNs with our architecture to be able to additionally model continuous dynamics. Thus, we aim to pave the way toward end-to-end learning of hybrid system models.

5 Related work

In this research, we address the challenge of learning an FSM from a given sample by employing constrained training of an RNN. The literature has presented various techniques for extracting an FSM from sample data, including state-merging-based approaches [4, 18, 32, 33], search-based techniques [34,35,36,37], and methods relying on SAT [9,10,11] and SMT [12, 13] solving. Similar to state-merging methods, we construct an IOPTA, with the distinction that we do not merge its states. The IOPTA solely establishes an upper bound on the maximum number of states of the ultimately learned FSM. Our findings in Sect. 4 demonstrate that our RNN-based method can learn more accurate FSMs for systems with more than two inputs and outputs compared to a well-established state-merging technique [18, 32].

Early work on the relationship between finite automata and neural networks examined the capacity of neural networks to simulate the behavior outlined by a finite automaton. Pioneering this investigation, Kleene [38] was among the first to demonstrate the suitability of neural networks for such simulations. Subsequently, Minsky [39] provided a comprehensive construction of neural networks capable of simulating finite automata. In contrast, we do not simulate a known automaton, but rather we learn an FSM from a sample of a regular language.

The relationship between neural networks and automata has also been exploited to explain the behavior of neural networks, by extracting automata from trained neutral networks. The process of extracting an automaton from a neural network is commonly referred to as knowledge distillation [40]. The literature comprises various techniques for knowledge distillation: Giles et al. [41] demonstrated the extraction of deterministic finite automata (DFAs) from a particular type of RNNs trained on regular languages, a method later refined by Omlin and Giles [42] to be applicable to a broader range of RNN architectures. In a more general context, Wang et al. [43] conducted a benchmark study on different RNN architectures for DFA extraction. Furthermore, Wang et al. [44] presented an empirical evaluation, validating the reliability of the approach introduced by Giles et al. [41]. The basis for their approach is that hidden states of RNNs form clusters; thus, automata states can be identified by determining such clusters. This property was recently also used to learn DFAs [26]. Tino and Sajada [45] used self-organizing maps to identify clusters for modeling Mealy automata. Michalenko et al. [46] empirically analyzed this property and found that there is a correspondence between hidden-state clusters and automata states, but some clusters may overlap, i.e., some states may be indistinguishable. Hong et al. [47] and Dong et al. [31] utilized clustering to find an adequate state abstraction and subsequently apply a state-merging algorithm on the abstracted traces to learn DFAs and Markov chains, respectively. The use of these techniques relies on the assumption that the nodes of neural networks form distinct clusters.

In contrast to relying on clustering, which may not be perfect, we enforce a clustering of the hidden states through regularization. Closest to our work in this regard is the work by Oliva and Lago-Fernández [27]. They enforce neurons with sigmoid activation to operate in a binary regime, thus leading to very dense clusters, by introducing Gaussian noise prior to applying the activation function during training. The identified clusters correspond to states of the FSM, where the number of clusters is not necessarily minimal. Similarly to our approach, they apply minimization algorithms to minimize the extracted automaton. However, a key distinction from our technique lies in the fact that the definition of state transitions relies on the inference of clusters in their approach, whereas our proposed RNN architecture directly predicts the next states.

An additional method for knowledge distillation of neural networks involves treating the trained neural network as an oracle that can be queried. Using an active learning approach, the learning algorithm constructs a behavioral model based on the responses obtained from these queries. Several approaches have been recently proposed based on or related to the \(L^*\) algorithm [5]. Weiss et al. proposed automata-learning-based approaches to extract DFAs [26], weighted automata [48], and subsets of context-free grammars from RNNs [49]. Mayr and Yovine [50] applied the \(L^*\) algorithm to extract automata from RNNs, where they provide probabilistic guarantees. Khmelnitsky et al. [51] propose a property-directed verification approach for RNNs. They use the \(L^*\) algorithm to learn automata models from RNNs and analyze these models through model checking. Muškardin et al. [28] examine the effect of different equivalence-query implementations in \(L^*\)-based learning of models from RNNs. Barbot et al. [52] use an \(A^*\)-based technique to extract push-down automata from RNNs simulating context-free grammars. In addition to \(L^*\)-related approaches, there are also query-based techniques utilizing Hankel matrices [53]: Both, Eyraud and Ayache [54], and Lacroce et al. [55] create weighted automata by populating a Hankel matrix through queries to an RNN. Motivated by the recent popularity of knowledge distillation, the TAYSIR competition [56] has been initiated, with the aim of creating models that provide simpler representations of trained RNN and transformers. Muškardin et al. [57] emerged as the winners of the competition, employing a learning-based testing approach following their earlier work [28] and utilizing the automata learning library AALpy [6]. It is important to note that these active techniques focus on extracting an automaton from a trained neural network, whereas our approach involves a constrained training technique for RNNs to extract automata from given samples.

Another approach to extract an FSM from a trained neural network is to additionally train an autoencoder that encodes a neural network as a finite state representation. Koul et al. [58] introduce quantization through training of quantized bottleneck networks into RNNs that encode policies of autonomous agents. This allows them to extract FSMs in order to understand the memory usage of recurrent policies. Carr et al. [59] use quantized bottleneck networks to extract finite-state controllers from recurrent policies to enable formal verification.

6 Conclusion

In this work, we presented a new machine learning technique for learning minimal finite-state models in the form of Mealy machines. Our automata learning approach builds upon a specialized RNN architecture together with a constrained training method in order to construct a fixed-size Mealy machine from given training data. Starting from an upper bound on the number of states, the approach iteratively creates models of decreasing size. This iterative process enables learning of minimal models. In common with classical passive automata learning methods, like RPNI [4] and ALERGIA [32], we start from a tree-shaped representation of the training data. However, instead of explicit state merging, we employ RNN training to search for a smaller representation of the data.

We evaluated our method on example grammars from the literature as well as on a Bluetooth protocol implementation. In almost all cases, we were able to learn a minimal automaton representing correctly the ground truth. Where this was not the case, it was due to missing training data. Nevertheless, the learned Mealy machine was correct with respect to the training data. A clear advantage compared to our previous work is that the user does not need to know the number of states in advance.

We see the encouraging results as a step toward learning more complex models comprising discrete and continuous behavior, as found in many control applications. Control applications commonly have only a few modes (discrete states) but may possess complex continuous behavior. For this reason, we focus on small automata in this work. As a next step, we see the integration of continuous behavior into our models as the most promising avenue for future work. That is, we plan to learn the discrete behavior with the regularized training described in this article while learning continuous behavior through conventional RNN training. Having finite-state models of hybrid systems will especially help in the explainability and interoperability of decisions of hybrid system controllers. We leave these investigations for future work. We will also apply our approach to case studies with larger numbers of states. However, other automata learning techniques focused on discrete behavior may be more suitable.

Finally, we dare to express the hope that this work might contribute to bridging the gap between the research communities in machine learning and automata learning ultimately leading to more trustworthy AI systems.