Probabilistic black-box reachability checking (extended version)

Model checking has a long-standing tradition in software verification. Given a system design it checks whether desired properties are satisfied. Unlike testing, it cannot be applied in a black-box setting. To overcome this limitation Peled et al. introduced black-box checking, a combination of testing, model inference and model checking. The technique requires systems to be fully deterministic. For stochastic systems, statistical techniques are available. However, they cannot be applied to systems with non-deterministic choices. We present a black-box checking technique for stochastic systems that allows both, non-deterministic and probabilistic behaviour. It involves model inference, testing and probabilistic model-checking. Here, we consider reachability checking, i.e., we infer near-optimal input-selection strategies for bounded reachability.


Introduction
Model checking has a long-standing tradition in software verification. Given a system design, model-checking techniques determine whether requirements stated as formal properties are satisfied. These techniques and other forms of model-based verification fall short if no design is available. Model learning provides a solution to this issue. It establishes the basis for model-based verification by automatically learning automata models of black-box systems from observed data. Data used as basis for learning is usually given in the form of system traces, that is, sequences of system events, which may be partitioned into input and output events. Note that model learning is also often referred to as model inference, thus we use both terms interchangeably.
There are two main forms of model learning: passive learning and active learning. Passive learning learns from preexisting data such as system logs, while active learning actively queries the system that is examined to gain relevant information. This can for instance be done via testing. Noteworthy early examples of passive learning techniques are RPNI for B Martin Tappler martin.tappler@ist.tugraz.at 1 Institute of Software Technology, Graz University of Technology, Graz, Austria regular languages [27,41] and Alergia [11] for stochastic regular languages, which learn deterministic finite automata (DFAs) and their stochastic counterparts, respectively. Both, RPNI and Alergia, apply a principle called state-merging. More recent work based on this principle extends the applicability of passive model learning to timed systems [52], Moore machines [24] and to stochastic systems involving non-deterministic choices [35,36], which we use in this article. All these approaches have in common that the models they learn depend on given sampled training data.
In contrast to this, active learning approaches rely on the possibility to query relevant information. Angluin formalised this by introducing the minimally adequate teacher framework in her seminal work on the L * algorithm [4]. This framework assumes the existence of a teacher that is able to answer two types of queries: membership queries and equivalence queries. When a model of a software system is learned, membership queries basically check whether a given traces can be observed and equivalence queries check whether a hypothesised system model is equivalent to the system under investigation. In practice, both queries are usually implemented via testing. Since the introduction of L * , it has been adapted and extended to various types of systems like Mealy machines [37,46], timed systems [25] and non-deterministic systems [28,53]. There are also L * -based learning approaches applicable for probabilistic system models [9,19], but they place strong assumptions on the information that can be queried. These approaches are therefore unsuitable considering a testing scenario, which allows interaction with a black-box system only via testing.
In this paper, we consider such a testing scenario, in which we know the interface of a black-box system and we can gain information by testing the system. Furthermore, we assume that inputs to the system can be freely chosen and that reactions are stochastic. This makes Markov decision processes (MDPs) a well-suited choice of model type. MDPs allow for non-deterministic choices of inputs, while state transitions are stochastic, whereby outputs are produced depending on the entered state. Given such a system, we aim at generating testing strategies that produce desired outputs with high probability. For learning, we rely on an adaptation of Alergia [11] called IOAlergia [35,36], which learns MDPs. While this learning technique is passive in general, our technique is active, as we generate new data for learning by testing. In an iterative approach, we steer the data generation based on learned models towards desired outputs to explore relevant parts of the system more thoroughly. That way, we aim at iteratively improving the accuracy of learning with respect to these outputs. This is in contrast to the application of IOAlergia in an active setting by Chen and Nielsen [13], as they aimed at actively improving the overall accuracy of learned models.
Model learning enables various forms of verification for black-box systems, such as differential equivalence testing on model-level [5,48,49], model-checking [20,21] and model-based testing with learned models [1]. A particularly interesting technique combining model learning, model checking and testing is black-box checking introduced by Peled et al. [42]. This technique learns models of black-box systems in the form of DFAs on-the-fly and iteratively via L * . Whenever a hypothesis automaton model is created, the hypothesis is model checked which may reveal a fault in the system or show that learning was incomplete and needs to be continued. If model checking does not reveal a fault, equivalence between the hypothesis and the black-box system is checked via testing. In case, non-equivalence is detected the learned hypothesis is extended and learning continues.
Inspired by black-box checking, we propose an approach to analyse reactive systems exhibiting stochastic behaviour in a black-box setting. We also follow a learning-based approach involving repeated testing and probabilistic model-checking. Instead of targeting general properties, e.g., formulated in probabilistic temporal logics, we check reachability as a first step. Since we follow a simulation-based approach, we check bounded reachability. Rather than learning DFAs, we assume that systems can be modelled by MDPs. Hence, we consider systems controllable by inputs, chosen by an environment or a tester. As such, these systems involve non-determinism resulting from the choice of inputs and stochastic behaviour reflected in state transitions and outputs. Given such a system, our goal in bounded reachability checking is to find an input-selection strategy, i.e. a resolution of non-determinism, which maximises the probability of reaching a certain property within a bounded number of steps. Properties can, for example, be the observation of desired outputs. A possible application scenario for our technique is stress testing of systems with stochastic failures. We could generate a testing strategy that provokes such failures.
The approach we follow is shown in Fig. 1. First, we sample system traces randomly. Then, we infer an MDP from these traces via the state-merging-based method described by Mao et al. [35,36], which as noted above is called IOAlergia. Once we inferred a hypothesis model M h1 , we use the Prism model checker [29] for a reachability analysis to find the maximal probability of reaching a state satisfying a property ψ. Prism computes a probability p and a strategy s 1 (also called adversary or scheduler) to reach ψ with p. Since IOAlergia infers models from system traces, the quality of the model M h1 depends on these traces. If ψ is not adequately covered, s 1 inferred from M h1 may perform poorly and rarely reach ψ. To account for that, we follow an incremental process. After initial random sampling, we iteratively infer models M hi from which we infer strategies s i . To sample new traces for M hi+1 we select inputs randomly and based on s i , that is, we use the strategy s i for directed testing. Selecting inputs with s i ensures that paths relevant to ψ will be explored more thoroughly. This process is repeated until either a maximal number of rounds n has been executed, or a heuristic detects that the search has converged to a scheduler.
We mainly use Prism to generate strategies, but ignore the probabilities computed in the reachability analysis. Since the computations are based on possibly inaccurate learned models, the probabilities may significantly differ from the true probabilities. Strategies, however, may serve as testing strategies regardless of the accuracy of the learned models. In fact, we evaluate the final strategy generated in the process described above via directed testing of the the system under test (SUT). Since the behaviour under a strategy is purely probabilistic, this is a form of Monte Carlo simulation, which is commonly used in statistical model-checking (SMC) [31]. The evaluation provides an estimation of the probability of reaching ψ with the actual SUT under strategy s l , where l is the last round that has been executed. By directly interacting with the SUT during evaluation, the computed estimation is an approximate lower bound for the optimal probability. In contrast to this, the reachability probabilities computed by Prism based on the learned model do not enjoy this property.
In summary, we combine techniques from various disciplines to optimise the probability of observing desired outputs within a bounded number of steps.

Learning
We rely on IOAlergia for learning MDPs. This algorithm has been developed with verification in mind and evaluated in a model-checking context [35,36]. Probabilistic model-checking We use Prism [29], a state-of-the-art probabilistic model checker, to generate strategies for bounded reachability based on learned models. Testing Directed sampling guided by a strategy is a form of modelbased testing with learned models. The sampling algorithm was developed for the presented technique. Statistical model-checking We evaluate the final strategy on the SUT. As the SUT is a black-box, we cannot apply probabilistic model-checking and instead perform a Monte Carlo simulation to estimate reachability probabilities, like in SMC [31].
Parts of this paper have already been published in the proceedings of the 17th International Conference on Runtime Verification [3]. Additional content presented in the current paper covers the heuristic check for detecting convergence, a more thorough evaluation including two new case studies and several further improvements throughout the paper.
The rest of this paper is structured as follows. In Sect. 2, we will discuss related work. Section 3 introduces preliminaries used in Sect. 4 which discusses the proposed approach. We present evaluation results in Sect. 5. Finally, we provide an outlook on future work and conclude in Sect. 6.

Related work
As discussed before, black-box checking [42] is closely related. In contrast to our technique, it considers non-stochastic systems, but more general properties. Various follow-up work demonstrates the potential of learning-based verification. Extensions, e.g., take existing models into account [26], focus on the composition of black-box and white-box components [17], or check security properties [47].
Mao et al. [34][35][36] also inferred probabilistic models with the purpose of model checking. In fact, we apply the model-inference technique for MDPs described by them. Wang et al. [54] apply a variant of Alergia as well and take properties into account during model inference with the goal of probabilistic model-checking. They apply automated propertyspecific abstraction/refinement to decrease the model-checking runtime. Nouri et al. [39] also combine stochastic learning and abstraction with respect to some property. Their goal is to improve the runtime of SMC. Notably, their approach could also be applied for black-box systems, but does not consider controllability via inputs. Further work on SMC of black-box systems can be found in [45,55].
Although we did not adapt IOAlergia, a passive model-inference technique, we apply it in an active setting. Chen and Nielsen [13] describe active learning of MDPs based on IOAlergia. However, they do not aim at model checking, but try to reduce the required number of samples by directing sampling towards uncertainties.
We try to find optimal schedulers for MDPs. This problem has been solved in other simulation-based verification approaches as well, like in SMC. A lightweight approach for finding schedulers in SMC is described in [14,33]. By representing schedulers efficiently, they are able to consider history-dependent schedulers and through "smart sampling" they accomplish finding near-optimal schedulers with low simulation budget. Brázdil et al. [10] presented an approach to unbounded reachability analysis via SMC. The technique is based on delayed Q-learning, a form of reinforcement learning, requiring only limited knowledge of the system (but more than our technique). Another approach using reinforcement learning for strategy inference for reachability objectives has been presented by David et al. [15]. They minimise expected cost while respecting worst-case time bounds.
Learning-based synthesis of control strategies for MDPs has also been studied by Fu and Topcu [23]. They obtain control strategies which are approximately optimal with respect to linear temporal logic (LTL) specifications. They consider transition probabilities to be initially unknown, but in contrast to our setting they assume the MDP structure to be known.

Preliminaries
We introduce background material following [22,36], but consider only finite traces, finite paths, and bounded reachability, as we use a simulation-based approach. The restriction to bounded properties is also commonly found in SMC [31], which is also simulation-based and from which we apply concepts. Moreover, SMC of unbounded properties is especially challenging in a black-box setting [32]. Basics. Let Σ in and Σ out be sets of input and output symbols. An input/output string s is an alternating sequence of inputs and outputs, starting with an output, i.e. s ∈ Σ out × (Σ in × Σ out ) * . We denote by |s| the number of input symbols in s and refer to it also as string/trace length. Given a set S, we denote by Dist(S) the set of probability distributions over S, i.e. for all μ in Dist(S) we have μ : S → [0, 1] such that s∈S μ(s) = 1. We denote the indicator function by 1 A which returns 1 for e ∈ A and 0 otherwise.
In Sect. 4, we apply two pseudo-random functions coinFlip and randSel. These require an initialisation operation which takes a seed-value for a pseudo-random number generator and which returns implementations of both functions. The function coinFlip implements a biased coin flip and is defined for p ∈ [0, 1] by P(coinFlip( p) = ) = p and P(coinFlip( p) = ⊥) = 1 − p. The function randSel takes a set as input and returns a single element of the set, whereby the element is chosen according to a uniform distribution, i.e. ∀e ∈ S : P(randSel(S) = e) = 1 |S| .

Markov decision processes
MDPs allow modelling reactive systems with probabilistic responses. An MDP starts in an initial state. During execution, the environment may choose and execute inputs non-deterministically upon which the system reacts according to its current state and its probabilistic transition function. For that, the system changes its state and produces an output.

Definition 1 (Markov decision process (MDP)) A Markov decision process (MDP) is a tuple
-Q is a finite set of states, -Σ in and Σ out are finite sets of input and output symbols respectively, q 0 ∈ Q is the initial state, -δ : Q × Σ in → Dist(Q) is the probabilistic transition function, and -L : Q → Σ out is the labelling function.
The transition function δ must be defined for all q ∈ Q and i ∈ Σ in . We consider only deterministic MDPs, i.e. ∀q ∈ Q, ∀i : . Non-determinism thus results only from the non-deterministic choice of inputs by the environment.
The above definition requires MDPs to be input-enabled, that is, they must not block or reject inputs. Since we assume SUTs to be MDPs, this allows us to execute any input at any point in time.
We generally set Σ out = P(AP) where AP is a set of relevant propositions and L(q) gives the propositions that hold in state q. A finite path ρ through an MDP is an alternating sequence of states and inputs, i.e. ρ = q 0 i 1 q 1 · · · i n−1 q n−1 i n q n ∈ Q × (Σ in × Q) * . The set of all paths of an MDP M is denoted by Path M . A path ρ corresponds to a trace L(ρ) = t, i.e. an input/output string, with t = o 0 i 1 o 1 · · · i n−1 o n−1 i n o n and L(q i ) = o i . To reason about probabilities of traces, we need a way to resolve non-determinism. To accomplish this, we introduce schedulers which are often also referred to as adversaries or strategies [36]. Schedulers basically choose the next input action (probabilistically) given a history of visited states, i.e. a path.

Definition 2 (Scheduler) Given an MDP
To define a probability distribution over finite paths, we need another component, a probability distribution p l ∈ Dist(N 0 ) over the length of paths. Given an MDP M = Q, Σ in , Σ out , q 0 , δ, L , a length probability p l and a scheduler s induce a probability distribution P l M,s on the set of paths Path M , defined by: Probability distributions over finite paths may instead of p l , e.g., include state-dependent termination probabilities [34]. We take a path-based view because we actively sample from MDPs. As paths correspond to traces, P l M,s defines a distribution over traces as well and we control the trace length via p l .
Since we target reachability, we do not need general schedulers, but may restrict ourselves to memoryless deterministic schedulers [30]. A scheduler is memoryless if its choice of inputs depends only on the current state, i.e. it is a function from Q to Dist(Σ in ). It is deterministic if for all ρ ∈ Path M , there is exactly one i ∈ Σ in such that s(ρ)(i) = 1. Otherwise, it is called randomised. Example 1 describes an MDP and a scheduler for a faulty coffee machine.
Note that bounded reachability actually requires finite-memory schedulers. However, bounded reachability can be encoded as unbounded reachability by transforming the MDP model [10], at the expense of increased state space. Figure 2a shows an MDP modelling a faulty coffee machine. Edge labels denote input symbols and corresponding transition probabilities, whereas output labels are placed above states. After insertion of a coin and pressing a button, the coffee machine is supposed to , all strings must have length 2, such that, e.g., P l M,s (ρ) = 0.9 for ρ = q 0 · coin · q 1 · but · q 2 .

Model inference
We infer MDPs via an adaptation of Alergia, called IOAlergia [11,35,36]. The technique takes input-output strings as input and constructs an input output frequency prefix tree acceptor (IOFPTA) representing the strings. An IOFPTA is a tree with edges labelled by inputs and nodes labelled by outputs. Additionally, edges are annotated with frequencies denoting how often a corresponding string was present in the sample. An IOFPTA with normalised frequencies represents a tree-shaped MDP whereby tree nodes correspond to MDP states. In a second step, the IOFPTA is transformed through iterated state-merging, which potentially introduces cycles. This step compares nodes in the tree and merges them if they show similar output behaviour such that it is likely that they correspond to the same state in the MDP, generating the data. IOAlergia basically views the IOFPTA as an MDP with nonnormalised transition probabilities. During the operation of the algorithm, the states of the MDP are partitioned into three sets: red states which have been checked, blue states which are neighbours of red states, and uncoloured states. Initially, the only red states is the root of the IOFPTA. After initialisation, pairs of blue and red states are checked for compatibility and merged if compatible. Otherwise, the blue one is coloured red. This is repeated until all states are coloured. After normalisation of transition probabilities, IOAlergia returns an MDP.
Two states are compatible if they have the same label, their outgoing transitions are compatible and their successors are recursively compatible. Outgoing transitions are compatible, if their empirical probabilities, estimated from the data, are sufficiently close to each other. In other words, we check for all inputs if the estimated probability distribution over outputs conditioned on inputs are sufficiently similar. If they are, we check recursive compatibility of successors reached by all input-output pairs. A parameter Alergia ∈ (0, 2] controls the significance level of a statistical test, which determines whether two empirical probabilities are sufficiently close. We represent calls to IOAlergia by IOAlergia(S, Alergia ) = M where S is a multiset of input-output strings and M is a resulting MDP. Figure 1 shows an IOFPTA for the coffee machine from Example 1, but sampled with a (uniformly) randomised scheduler and a different p l . Edge labels denote inputs and associated frequencies, while outputs are placed next to nodes. At first, s 1 might be merged with s 0 as their successors are similar. Redirecting the but edge from s 1 to s 0 would create the self loop in the initial state.
The formula φ = F <k ψ denotes that ψ should be satisfied in a state reached in less than k steps. We define the satisfaction of φ = F <k ψ via: The evaluation of a trace t with respect to a formula φ = F <k ψ places restrictions on the length of t. In particular, we can only conclude that t | φ if t does not contain an o i with o i | ψ and contains at least k − 1 steps. In other words, t must be long enough to determine that it does not satisfy a reachability property. To ascertain that all traces can be evaluated, we set the length probability p l accordingly. We set for all traces: The composition of a scheduler s and an MDP M behaves entirely probabilistically. In fact, it induces a discrete time Markov chain (DTMC) [22]. Hence, we can apply techniques from SMC without considering non-determinism. Furthermore, we can define the probability of satisfying a property φ with an MDP M, and a scheduler s by P M, for an appropriate p l . Note that the value P M,s (φ) does not depend on the actual p l as long as p l ensures that traces are long enough to allow reasoning about satisfaction of φ.
We estimate this probability via SMC [31]. For that, we associate Bernoulli random variables B i with success probability p with simulations of the SUT. A realisation b i is 1 if the corresponding sampled trace satisfies φ and 0 otherwise. To estimate p = P M,s (φ) we apply Monte Carlo simulation. Given n individual simulations, the estimatep is the observed relative success frequency, i.e.p = n i=1 b i n . In order to bound the error of the estimation with a certain degree of confidence, we compute the number of required simulations based on a Chernoff bound [31,40]. This bound guarantees that if p is the true probability, then the distance betweenp and p is greater than or equal to some with a probability of at most δ, i.e. P(|p − p| ≥ ) ≤ δ. The required number of simulations n and the parameters and δ are related by δ = 2e −2n 2 [40], i.e. we compute n by

Probabilistic black-box reachability checking
We begin with an overview of the approach, discussing each of the steps briefly. Subsequently, we provide in-depth descriptions. For the remainder of this paper, we assume that we interact with an MDP M = Q, Σ in , Σ out , q 0 , δ, L representing the SUT of which we only know Σ in . Basically, we try to find an optimal scheduler for satisfying a given reachability property φ = F <k ψ with the SUT M. For this purpose, we iteratively sample system traces, learn models from the traces and infer schedulers from the learned models. The inferred schedulers also serve to refine the sampling strategy. This process is also shown in Fig. 1 We may stop before executing maxRounds rounds if a stopping criterion is satisfied. This criterion is realised with a heuristic check for convergence. In this check, we basically determine whether several consecutive schedulers behave similarly.

Evaluate
In a last step, we evaluate the most recent scheduler we have generated. For this evaluation, we sample system traces again, but avoid choosing inputs randomly. The relative frequency of satisfying φ now gives us an estimate for the success probability of executing M, the black-box SUT, controlled by scheduler s l , where l is the last round we executed. A Chernoff bound [31,40], which is commonly used in SMC, specifies the required number of samples.

Create initial samples
In the first step, we sample system traces by choosing input actions randomly according to a uniform distribution. Hence, we sample with a scheduler s unif defined as follows: ∀q ∈ Q, s unif : Sampling is further controlled by the length probability p l and by the batch size n batch , i.e. the number of traces collected at once. These parameters also affect subsequent sampling. Additionally, we set a seed-value for the initialisation of pseudo-random functions.
As discussed in Sect. 3, we set p l ( j) = 0 for j < k −1 if k is the step bound of the property we test for. This would not be necessary for learning but we generally apply this constraint. The length of suffixes, i.e. the trace extensions beyond k, follows a geometric distribution parameterised by p quit ∈ [0, 1]. Before each step, we stop with probability p quit . Hence, the number of input steps |t| in a trace t is distributed according to p l (|t|) = (1− p quit ) |t|−k+1 p quit for |t| ≥ k − 1 and p l (|t|) = 0 otherwise. Both p quit and n batch must be supplied by the user. In the following, S i denotes the multiset of traces created by the ith sampling step, and S all refers to the multiset of all traces. Hence, S all is initially set to S 1 , containing n batch traces distributed according to P l M,s unif , collected by random testing.

to infer an MDP
i.e. an approximate system model. Strictly speaking, we infer an MDP with a partial transition function, which we make inputcomplete with a function complete. The transition function of an inferred MDP may be undefined for some state-input pair if there is no corresponding execution in S all . For this reason, we add transitions to a special state labelled with dontKnow for undefined state-input pairs. Once we enter that state, we cannot leave it. The label dontKnow is more generally a special output label, which is not part of the original output alphabet. Formally Following the terminology of active automata learning [4], we refer to M hi as the current hypothesis. Input completion via complete is required by Definition 1, but does not affect the reachability analysis. We aim at maximising the probability of desired events, therefore generated schedulers will not choose to execute inputs leading to the state q undef labelled dontKnow. This is due to the fact that once we reached q undef , we have a probability of zero to observe anything other than dontKnow according to our hypothesis.
Reachability analysis Given the current hypothesis inferred in the last step, our implementation of the approach uses the Prism model checker [29] to derive a scheduler for satisfying the property φ. This is achieved by performing the following steps in a fully automated manner: 1. Translate M hi into the Prism modelling language, whereby we encode 1.1 states using integers, 1.2 inputs using commands labelled with actions, and 1.3 outputs using labels.
2 Since Prism only supports scheduler generation for unbounded reachability properties, we preprocess the translated M hi further [10] and encode φ as unbounded property: 2.1 We add a step-counter variable steps ranging between 0 and k, where k is the step bound of the examined property. 2.2 The variable steps is incremented with every execution of an input until the maximal value k is reached. Once steps = k, steps is left unchanged. 2. 3 We change φ to φ = F(ψ ∧ steps < k), i.e. we move the bound from the temporal operator to the property that should be reached.

Algorithm 1 Property-Directed Sampling
Sample with scheduler. Property-directed sampling with inferred schedulers aims at exploring parts of the system more thoroughly that have been identified as being relevant to the property. To avoid getting trapped in local minima, we also explore new paths by choosing actions randomly with probability p randi , where i corresponds to the current round. This probability is decreased in each round to explore more broadly in the beginning and focus on relevant parts in later rounds. Two parameters control p randi : p start ∈ [0, 1] for the initial probability and c change ∈ [0, 1] specifying an exponential decrease, i.e. p rand1 = p start and p randi+1 = c change · p randi for i ≥ 1. Basically, we execute both SUT and M hi in parallel. The former ensures that we sample traces of the actual system while the latter is necessary because the inferred scheduler s hi is defined for M hi . Stated differently, we need to simulate the path taken by the SUT on the current hypothesis M hi . This enables selecting actions with s hi . As M hi is an approximation, two scenarios may occur in which we cannot use s hi . In the following scenarios, we default to choose inputs randomly: 1. The SUT may show outputs not foreseen by M hi , i.e. not only probabilities differ. In such cases, we cannot determine the correct state transition in M hi . 2. By performing random inputs we may follow a path that is not optimal with respect to M hi and φ. Thus, we may enter a state where s hi is undefined.
The sampling is detailed in Algorithm 1. In addition to artefacts generated by other steps, such as the input-enabled hypothesis 1 M hi and the generated scheduler s hi , sampling requires two auxiliary operations: reset: resets the SUT to the initial state and returns the unique initial output symbol exec: executes a single input changing the state of the SUT and returning the corresponding output Both operations are realised by a test adapter. Lines 1 to 5 of the algorithm collect traces by calling the Sample function and update p randi . The Sample function returns a single trace, which is created on-the-fly and initialised with the output produced upon resetting the SUT (line 7). Line 8 initialises the current model state. Afterwards, an input is chosen (lines 10 to 13). It is chosen randomly with probability p randi or if we cannot determine an input (line 10 and 11). Otherwise, the input is selected with s hi (Line 13). We record the output of the SUT in response to the input and extend the trace in lines 14 and 15. The next two lines update the model state. In case, the SUT produces an output which is not allowed by the model, the new model state is undefined (second case in Line 17). This corresponds to the first scenario, in which we default to choose inputs randomly. Trace creation stops if the trace is long enough and if a probabilistic Boolean choice returns true (Line 9), i.e. the actual length follows a probability distribution. Note that lines 10 to 17 implement a randomised scheduler derived from s hi . We will refer to this scheduler as randomised(s hi ). In the taxonomy of Utting et al. [51], property-directed sampling can be categorised as model-checking-based online testing with a combination of requirements coverage and random input-selection as criterion.
Evaluate. As a result of the reachability analysis, Prism calculates a probability of reaching φ. This probability, however, is derived from a learned model which is possibly inaccurate. Therefore, it may greatly differ from the actual probability of reachability with the SUT. To account for that, we evaluate the scheduler s h = s hl , where l is the last round we executed. We accomplish this by sampling a multiset of traces S eval , while generally selecting inputs with s h , i.e. we execute Algorithm 1 with p randi = 0. Thereby, we implicitly sample traces from the DTMC induced by the composition of M and randomised(s h ). Since this DTMC behaves entirely probabilistic, we can apply SMC. Hence, we estimate To achieve a given error bound eval with a given confidence 1 − δ eval , we compute the required number of samples |S eval | = n batch based on a Chernoff bound [40], i.e. we apply (2). The estimation provides an approximate lower bound of the maximal reachability probability with the SUT. We considerp M,s h an approximate lower bound, because we know with confidence 1−δ eval that max s P M,s (φ) is at least as large aŝ p M,s h − eval .

Check early stop
We have observed that the performance of schedulers usually increases with the total amount of available data. Probability estimations derived with intermediate schedulers showed that schedulers generated in later rounds tend to perform better than those generated in earlier. However, we have also seen fluctuations in these estimations over time, i.e. some schedulers may perform worse than schedulers generated in previous rounds. With increasing number of rounds these fluctuations generally diminish and the estimations converge. Intuitively, this can be explained by the influence of p randi in Algorithm 1, which controls the probability of selecting random inputs and decreases over time. As this probabilities p randi approaches zero, we will almost always select inputs with generated schedulers. This will generally only increase the confidence in parts of the system we have already explored, but will not explore new parts and therefore new schedulers are likely to show similar behaviour to previous ones.
Based on these observations, we developed a heuristic check for convergence. If it detects convergence, we stop the iteration early before executing maxRounds rounds. Two simpler checks actually form the basis of the heuristic. The first, called similarSched, basically compares the scheduler generated in the current round to the scheduler from the previous round and returns true if both behave similarly. The second check, called conv builds upon the first and reports convergence if we detect statistically similar behaviour via similarSched in multiple consecutive rounds. The rationale behind this is that schedulers should behave alike after convergence. We check for similarity rather than for equivalence because there may be several optimal inputs in a state and slight variations in transition probabilities in the inferred models may lead to the different choices of inputs. Furthermore, we can compare schedulers during sampling by comparing whether they would choose the same inputs. This gives us a large number of events as basis for our decision and does not require additional sampling.
The convergence check has three parameters: α conv controlling the confidence level, an error bound conv , and a bound on the number of rounds r conv . The first two parameters control a statistical test which checks whether two schedulers behave similarly. For this test, we consider Bernoulli random variables E i for i ∈ [2. . maxRounds]. E i is equal to one if two consecutive schedulers s hi and s hi−1 behave equivalently, i.e. choose the same input in some state, and zero otherwise. Let p E i be the success probability, the probability of E i being equal to one. We observe samples of E i in Line 13 of Algorithm 1. Each time we choose an input i with s hi , we also determine which input i the previous scheduler s hi−1 would have chosen. We record a positive outcome if i = i and a negative outcome otherwise.
Letp E i be the relative number of positive outcomes, which is an estimate of p E i . If p E i is equal to one, then both schedulers behave equivalently, they always choose the same input. Consequently, we test whetherp E i is close to one. We test the null hypothesis H 0 : p ≤ 1 − conv against H 1 : p > 1 − conv with a confidence level of 1 − α conv . The hypothesis H 1 denotes that the compared schedulers choose the same inputs in most of the cases. Let similarSched(α conv , conv , i) be the result of this test in round i, which is true if H 0 is rejected and false otherwise.
Finally, we can formulate the complete convergence check conv(α conv , conv , i). It returns true in round i if r conv consecutive calls of similarSched returned true, i.e. conv(α conv , conv , i) = i j=i−r conf +1 similarSched(α conv , conv , j). Note that p randi implicitly affects the convergence check. We collect samples of E i in Line 13 of Algorithm 1, thus large p randi , cause Line 13 to be executed infrequently. As a result, sample sizes of E i are small. This influence on the convergence check is beneficial because schedulers are more likely to improve if p randi is large, as new parts of the system may be explored via frequent random steps.
While the check introduces further parameters, it may simplify the application of the approach in scenarios where we have little knowledge about the system at hand. In such cases, it may be difficult to find a reasonable choice for the number of rounds maxRounds. With this heuristic, it is possible to choose maxRounds conservatively, but stop early once convergence is detected. However, it may also impair results, if convergence is detected too early.
Convergence to the true model Generally, Mao et al. [36] showed convergence in the large sample limit for IOAlergia. However, the sampling mechanism needs to ensure that sufficiently many executions of all inputs in all states are observed. This is also discussed in [35], which states that IOAlergia requires a fair schedulers, one that chooses each input infinitely often. The uniformly randomised s unif satisfies this requirement. As a result, we have convergence in the limit, if we perform only a single round of inference, in which we sample with s unif .
Property-directed sampling favours certain inputs with increasing number of rounds, but it also selects random inputs with probability p randi in round i. If we ensure that p randi is always non-zero, we will select all inputs infinitely often in an infinite number of rounds. Therefore, the inferred models will converge to the true model (up to bisimulation equivalence) and the inferred schedulers will converge to the optimal scheduler.
Another way to approach convergence is to follow a hybrid approach, by collecting traces via property-directed sampling and via uniform sampling in parallel. Uniform sampling ensures that all inputs are executed sufficiently often, which entails convergence. Propertydirected sampling explores parts of the system, identified to be relevant, which increases the confidence in the correct inference of those parts. As a result, intermediate schedulers are more likely to perform well.
Hence, we have convergence in the limit under certain assumptions. In practice, i.e. when learning from limited data, uniform schedulers are likely to be insufficient, if events occur only after long interaction scenarios. If events occur rarely in the sampled system traces, then it is unlikely that the part modelling those events is accurately learned. Active learning, as described by Chen and Nielsen [13], addressed this issue by guiding sampling so as to reduce the uncertainty in the learned model. Our approach similarly guides sampling, but with the aim at reducing uncertainty along traces, which are likely to satisfy a reachability property.
As noted above, we have seen that the inferred schedulers usually converge to a scheduler, which may not be globally optimal, though. We also performed experiments with the outlined hybrid approach to avoid getting trapped in local maxima, by collecting half the system traces through uniform sampling. While it showed favourable performance in a few cases, the incremental approach generally produced better results with the same number of samples. Therefore, we will not discuss experiments with the hybrid approach.
Apart from convergence, it may not always be necessary to find a (near-) optimal scheduler. A requirement may state that the probability of reaching an erroneous state must be smaller than some p. Once we found and evaluated a scheduler s h such that the estimationp M,s h ≥ p, we basically show with some confidence that the requirement is violated. Such a requirement could be the basis of another stopping criterion. If in round i, a sufficiently large number of the sampled traces S i reaches an erroneous state, we may decide to evaluate the corresponding scheduler s hi−1 . We could then stop ifp M,s hi−1 ≥ p and continue otherwise.

Application and choice of parameters
We will now briefly discuss the choice of parameters taking our findings into account. A summary of all parameters along with a concise description is given by Table 1.
The product n s = n batch · maxRounds defines the overall maximum number of samples for inference, thus it could be chosen as large as the testing/simulation budget permits. Increasing maxRounds while fixing n s increases the time required for learning and model checking. Intuitively, it improves accuracy as well, as sampling is more frequently adjusted towards the considered property. For the systems examined in Sect. 5, values in the range between 50 and 200 led to reasonable accuracy while incurring acceptable runtime overhead. Runtime overhead is the time spent learning and model checking, as opposed to the time spent doing actual testing, i.e. (property-directed) sampling. The convergence check takes three parameters as input for which we identified well-suited default parameters. To ensure high confidence for the statistical test, we set α conv = 0.01. Since schedulers should choose the same input in most cases, conv should be small, but greater than zero to allow for some variation. In our experiments, we set it to conv = 0.01 and we set r conv = 6. More conservative choices would be possible at the expense of performing additional rounds.
The value of p start should generally be larger than 0.5, while c change should be close to 1. This ensures broad exploration in the beginning and more directed exploration afterwards. Finally, the choice of p quit depends on the simulation budget and the number of inputs. If there is a large number of inputs, it may be highly improbable to reach certain states within a

Experiments
We evaluated our approach on five case studies from the area of automata learning, control policy synthesis, and probabilistic model-checking. For the first case study, we created our own model of the slot machine described by Mao et al. [36] in the context of learning MDPs. Two case studies consider models of network protocols enhanced with stochastic failures. For that, we transformed deterministic Mealy-machine models as detailed below. The model used in the fourth case study is inspired by the gridworld example, for which Fu and Topcu synthesised control strategies [23]. Finally, we generate schedulers for a consensus protocol [6] which serves as a benchmark in probabilistic model-checking. We discussed the experiments involving the slot machine and the network protocol models before [3]. New additions in this extended version are experiments with the gridworld example, the consensus protocol, and experiments with the convergence check. Note that due to changes of the implementation, measurement result may differ from those in [3].

Adding stochastic failures
Deterministic Mealy-machines serve as the basis for two case studies. These Mealy machines are results from previous learning experiments [20,49] and model communication protocols. Basically, we simulate stochastic failures by adding outputs represented by the label crash. These occur with a predefined probability instead of the correct output. Upon such a failure, the system is reset. We implemented this by transforming the Mealy machines as follows: 1. Translate Mealy machine into Moore machine: this effectively creates an MDP M = Q, Σ in , Σ out , q 0 , δ, L with a non-probabilistic δ. 2. Extend Σ out with a new symbol crash and add q cr to Q with L(q cr ) = crash. 3. For a predefined probability p cr and for all o in a predefined set Crashes: 3.1. Find all q, q ∈ Q, i ∈ Σ in such that δ(q, i)(q ) = 1 and L(q ) = o 3.2. Set δ(q, i)(q ) = 1 − p cr and δ(q, i)(q cr ) = p cr 3.3. For all i ∈ Σ in set δ(q cr , i)(q cr ) = p cr and δ(q cr , i)(q 0 ) = 1 − p cr This simulates stochastic failures of outputs belonging to a set Crashes. Instead of producing the correct outputs, we output crash and reach state q cr with a certain probability. With the same probability we stay in this state and otherwise we reset the system to q 0 after the crash.

Measurement setup and criteria
We have complete information about all models. This allows us to compare our results to optimal values. Nevertheless, for the evaluation we treat the systems as black boxes. The state spaces of the models without step-counter variables for bounded reachability are of sizes 471 (slot machine), 63 (MQTT), 157 (TCP), 35 (gridworld), and 272 (consensus protocol), respectively. For each of these systems, we identified an output relevant to the application domain and applied the presented technique to reach states emitting this output with varying numbers of steps. The slot machine grants prizes and we generated strategies to observe the rarest prize. Using the steps discussed above, we seeded stochastic failures into the MQTT and TCP models, which we tried to reach. The gridworld we used in the evaluation contains a dedicated goal location that served as a as reachability objective. In case of the consensus protocol, we generated strategies to finish the protocol, i.e. reach consensus, with high probability.
For a black-box MDP M and a reachability property φ, we compare four approaches to find schedulers s for P M,s (φ):

Incremental Scheduler Inference
We apply the incremental approach discussed in Sect. 4 with a fixed number of rounds. Inferred schedulers are denoted by s inc .

Incremental with Convergence Check
We apply the incremental approach, but stop if we either detect convergence with conv or if maxRounds rounds have been executed. Inferred schedulers are denoted by s conv .

Monolithic Scheduler Inference
To check if the incremental refinement of inferred models pays off, we use the same approach but set maxRounds = 1. In other words, we sample traces by solely choosing inputs randomly. Based on this, we perform a single round, inferring a model and a scheduler which we evaluate. To balance the simulation budget, we collect maxRounds·n batch traces, where maxRounds and n batch are the parameter settings for inferring s inc . We denote monolithically inferred schedulers by s mono .

Uniform Schedulers
As a baseline for comparison we compare to the randomised scheduler s unif which chooses inputs according to a uniform distribution. This resembles random testing without additional knowledge.
Furthermore, let s opt = argmax s P M,s (φ) be the optimal scheduler. This is the optimal scheduler obtained for the generating model M. As the most important measure of quality, we compare estimates of P M,s (φ) to the maximal probability P M,s opt (φ). We consider  (2). We balance the number of test steps for the incremental and the monolithic approach by executing the same number of tests. As a result, the simulation costs for executing tests is approximately the same. Since the incremental approach requires model learning and model checking in each round, it will also require more computation time than the monolithic approach. While our main focus is on evaluating with respect to the achieved probability estimation, we will briefly discuss computation cost at the end of the section.
We also briefly discuss estimations based on model checking of inferred models M h , i.e. max s P M h ,s (φ) calculated by Prism [29]. These estimations have also been discussed by Mao et al. [36]. They noted that estimations may differ significantly from optimal values in some cases, but generally represent good approximations.

Implementation and settings
We base the evaluation on our Java implementation of the described technique which can be found at [43]. All experiments were performed with a Lenovo Thinkpad T450 with 16 GB RAM and an Intel Core i7-5600U CPU operating at 2.6 GHz and running Xubuntu Linux 18.04. The systems were modelled with Prism [29]. Prism served three purposes: -We exported the state, transition, and label information from models. We simulated the models in a black-box fashion with this information. -The maximal probabilities were computed via Prism.
-Prism's scheduler generation was used for scheduler inference.
Simulation, as well as sampling, is controlled by probabilistic choices. To ensure reproducibility, we used fixed seeds for pseudo-random number generators controlling the choices. All experiments were run with 20 different seeds and we discuss statistics derived from 20 such runs. For the evaluation of schedulers, we applied a Chernoff bound with eval = 0.01 and δ eval = 0.01. In contrast to the conference version of this paper [3], we used a fixed significance level for the compatibility check of IOAlergia, by setting Alergia = 0.5, a value also used by Mao et al. [36]. They noted that IOAlergia is generally robust with respect to the choice of this value, but we found that our approach benefits from a larger   Table 2 summarises parameter settings that apply in general.

Slot machine
The slot machine was analysed in the context of MDP inference before [36]. Basically, it has three reels which are initially blank and which either show apple or bar after spinning (one input per reel). With increasing number of spins the probability of bar decreases. A player is given a number of spins m, after which one of three prizes is awarded depending on the reel configuration. A fourth input leads with equal probability either to two extra spins (with a maximum of m), or to stopping the game prematurely including issuance of prizes. For the evaluation, we reimplemented the model, therefore probabilities and state space differ from [36]. As property, we investigated reaching the output Pr10 if m = 5, representing a prize that is awarded after stopping the game, if all reels show bar. The parameter settings for the learning experiments are given by Table 3, that is, p quit = 0.05, maxRounds = 100, and n batch = 1000. Figure 3 shows evaluation results comparing the different approaches. Box plots summarising the probability estimations for reaching Pr10 in less than 8 steps are shown in Fig. 3a and b shows results for a limit of 14 steps. From left to right, the blue boxes correspond to s mono , the black boxes correspond to s inc , and the red boxes correspond to s conv , i.e. the incremental approach with convergence check. Dashed lines mark optimal probabilities. Note that estimations may be slightly larger than the optimal value in rare cases because they are based on simulations. This can be observed for s inc in Fig. 3a and also in some of the following experiments. The applied Chernoff bound gives a confidence value for staying within error bound eval , in case we actually found an optimal scheduler.
Estimations with the baseline s unif are fairly constant, at approximately 0.012 for 8 steps and at 0.019 for 14 steps. As estimations with s mono , s inc , and s conv are significantly higher,  this shows that our approach positively influences the probability of reaching a desired event.
We further see that the incremental approach performs better than the monolithic, whereby the gap increases with step size. Unlike the monolithic approach, the incremental approach finds near-optimal schedulers in both cases. However, the relative number of near-optimal schedulers decreases with increasing step bound.
Early stopping via detecting convergence affects performance only slightly. The differences between the quartiles derived for s conv and for s inc are actually lower than the error bound eval = 0.01. Random variations could therefore be the cause of a visually perceived performance change. For the first experiment with a limit of 8 steps, early stopping reduced the number of executed rounds to 72.2 on average and to 71.05 for the second experiment. However, one run of the first experiment failed to stop early, because convergence was not detected in less than 100 rounds. Runs of the second experiment executed at most 92 rounds.
Alternatively to simulation-based estimation, estimations may be based on model checking an inferred model [36]. For that, a model M h is inferred, either incrementally or in a single step, and then a probabilistic model-checker computes max s P M h ,s (φ). In other words, SMC of the actual SUT controlled by an inferred scheduler is replaced by probabilistic modelchecking of a learned model. In the first scenario, estimations are generally bounded above by the optimal probability while estimations in the second scenario may also overestimate the true optimal probability. An advantage of the second scenario is that it reduces the simulation cost since SMC requires additional sampling of the SUT. Figure 4a and c show model-checking-based estimations of reaching Pr10 in less than 8 and 14 steps respectively. Here, s mono denotes that the models M h were inferred in one step, while s inc denotes incremental model-inference. Incremental model-inference with early stopping is labelled s conv . The figures demonstrate that these estimations differ from estimations obtained via SMC (see Fig. 3). The monolithic approach significantly overestimates in both cases. The incremental approach leads to more accurate results. None of the measurement results exceeds the optimal value by more than eval . Note that early stopping did not significantly affect these estimations. Still, the SMC-based estimations are more reliable in the sense that they establish an approximate lower bound for the true optimal probability.

MQTT with stochastic failures
This case study is based on a Mealy-machine model of an MQTT [8] broker, learned in previous work [49]. We transformed a model of the EMQ [18] broker interacting with two clients, adding stochastic failures to connection acknowledgements and subscription acknowledgements for the second client, whereby we set p cr = 0.1. For the evaluation we  For the sampling, we set p quit = 0.025. While this leads to samples longer than necessary for evaluation, e.g., for k = 5 the expected length of traces is 43, this increases the chance of seeing crash in a sample which is reflected in inferred models. The simulation budget is limited by maxRounds = 60, and n batch = 100 for the incremental approach without early stopping.
Since the experiments required more than 60 rounds for convergence to be detected, we set maxRounds to 240 for the incremental approach with convergence check. The parameter are also summarised in Table 4. Figure 5 shows box plots for the learning-based approaches. At each k, the box plots from left to right summarise measurements for s mono (blue), s inc (black), and s conv (red). The dashed line is the optimal probability achieved with s opt , and the solid line represents the average probability of reaching crash with a uniformly randomised scheduler. The box plots demonstrate that larger probabilities are achievable with learning-based approaches than with random testing. All runs including outliers reach crash with a higher probability than random testing. The monolithic approach, however, only performs marginally better in some cases. Both incremental approaches achieve near-optimal results more reliably. All learning-based approaches find at least one near-optimal scheduler out of 20, but incremental inference finds near-optimal schedulers more reliably.
The convergence check causes a reliability gain for k = 8 and k = 17 in this case study, as it basically detected that executing 60 rounds is not enough. It generally required more than 60 rounds to detect convergence, except in a few cases. Experiments for larger values of k required slightly more rounds to be executed, such that on average 79.6 rounds were executed for k = 17. In contrast to this, we executed on average only 72.15 for k = 5. We also see that most estimations of s inc and s conv are in a small range near to the optimal values. However, a few outliers are significantly lower, e.g. at 0.46 for k = 8. Therefore, it makes sense to infer multiple schedulers and discard those performing poorly.
Model-checking-based estimations of reaching crash with the incremental approach led to overestimations in some cases. For instance, the maximal estimation for k = 11 is 0.724 while 0.651 is the true optimal value, and also for k = 5 one run leads to a model-checkingbased estimation of 0.373 although 0.344 is the true optimal value. This is in contrast to the   Fig. 4), where the incremental approach produced results close to or lower than the optimal value.

TCP with stochastic failures
This case study builds upon a Mealy-machine model of Ubuntu's TCP server learned by Fiterȃu-Broştean et al. [20,50]. In previous work [2], we have shown that conformance testing of this system is challenging. Here, we consider a version with random crashes with p cr = 0.05, as discussed in the beginning of this section. We mutated transitions to states outputting an acknowledgement which increments both sequence and acknowledgement numbers. For the evaluation, we infer schedulers for P M,s (F <k crash) with k ∈ {5, 8, 11, 14, 17} and we set maxRounds = 120, and n batch = 250 for the incremental approach without early stopping. Consequently, we set n batch = 250 · 120 for the monolithic approach. Since the convergence check detected convergence only after 120 rounds in several experiments, we set maxRounds to 240 for the incremental approach with early stopping. We set p quit to the same value as for MQTT. The parameter settings are also shown in Table 5. Figure 6 shows box plots summarising the collected probability estimations. As before, there are groups of three box plots at each k, which from left to right represent s mono , s inc , and s conv . The figure does not include plots for random testing with s unif , because it reaches the crash with very low probability. Estimations produced by s unif are lower than 0.01 for all k. This demonstrates that random testing is insufficient in this case to reliably reach crashes of the system.
We further see that all learning-based approaches achieve to generate near-optimal schedulers for all k. As before, both configurations of the incremental approach are more reliable than the monolithic approach. For this more complex system, the reliability gain from incremental scheduler generation is actually much larger than for the MQTT experiments. Early stopping affects probability estimations only marginally. This is also in line with previous observations. Like for MQTT, we needed to set maxRounds to a value larger than initially planned, for convergence to be detected. There is a large spread in the number of executed rounds, e.g., we executed between 42 and 240 rounds for k = 14. In this case, convergence was detected after 133.5 rounds on average. The average number of executed rounds is lower than 135 rounds for all k.

Gridworld
The following case study is inspired by a motion-planning scenario discussed by Fu and Topcu [23], also in the context of learning control strategies. In the experiments, we generate schedulers for a robot navigating in a gridworld environment. These schedulers shall with high probability reach a fixed goal location after starting from a fixed initial location.
A gridworld consists of tiles of different terrains and is surrounded by walls. To model obstacles, interior tiles may be walls as well. The robot starts at a predefined location and may move into one of four directions, i.e. we select from four inputs. It can observe changes in the type of terrain, whether it bumped into a wall, and whether it is located at the goal location. If the robot bumps into a wall, it will not change location. Whenever the robot moves, it may not reach its target, but rather reach a neighbouring tile with some probability, unless the neighbouring tile is a wall. That is, if the robot moves north, it may reach the tile situated north west or north east to its original position. The probability of such an error depends on the terrain of the target tile. We distinguish the terrains (with error probabilities in parentheses): Mud (0.4), Sand (0.25), Concrete (0), and Grass (0.2). As indicated above, Wall is actually also a terrain that cannot be entered. Figure 7a shows the gridworld, we used for evaluation. Black tiles represent walls, while the other terrains are represented by different shades of grey and their initial letters. A circle marks the initial location and a double circle marks the goal location. Although its state space, containing 35 2 different states, is relatively small, navigating in this gridworld is challenging without prior knowledge. Initially, three moves to the right are necessary, as walls block direct moves towards the goal. This mimics the requirement of an initialisation routine.
Before discussing measurements, we want to briefly describe the structure of the MDP modelling this gridworld and how probabilities affect it. The initial state is labelled C and corresponds to the location with the coordinate (1, 1). Moving towards north is not possible, therefore the input north leads to a state labelled wall, which also corresponds to the coordinate (1, 1). If the robot instead moves towards east, it will reach a state corresponding to the coordinate (2, 1), which is labelled C. Moving from coordinate (3, 1) towards east, the target tile with the coordinate (4, 1) is labelled by M (Mud), which has a non-zero error probability of 0.4. Therefore, the input east causes a stochastic transition; with a probability of 0.6 the robot reaches the target coordinate (4, 1) and observes M, but with a probability of 0.4, it reaches the location (4, 2) to the south and observes C.
To infer schedulers, we applied the configuration given by Table 6, that is, maxRounds = 150, and n batch = 500, and p quit = 0.5. Due to the larger value of maxRounds, we increased c change as well to 0.975. This causes more random choices and thereby broad exploration in a larger number of rounds. As this case study differs significantly from the others, we chose maxRounds conservatively, performing a larger number of rounds.   Figure 7b shows measured estimations of P M,s (F <10 goal) for s inc , s conv , s mono , and random testing with s unif . The dashed line denotes the optimal probability.
Random testing obviously fails to reach the goal in less than ten steps. This is caused by the fact that it is unlikely to navigate past the walls via random exploration. The performance of the monolithic approach is also affected by this issue, because it learns solely from uniformly randomised sample traces. Random exploration covers only the initial part of the state space thoroughly. Therefore, the monolithically generated schedulers tend to perform worse than incrementally generated. By directing exploration towards the goal, the incremental approach achieves to generate near-optimal schedulers.
We also see that the impact of the convergence check is not severe. Both settings, with and without convergence check, produced similar results. The convergence check was able to reduce simulation costs for all but three runs of the experiment, in which convergence was not detected in less than 150 rounds. The incremental scheduler generation required at least 94 rounds and on average 131.9 rounds were executed before convergence was detected.

Shared coin consensus
The last case study examines scheduler generation for a randomised consensus protocol by Aspnes and Herlihy [6]. In particular, we used a model of the protocol distributed with the PRISM model checker [29] as a basis for this case study. 3 Note that we did not change the functionality of the protocol, but only performed minor adaptions such as adding action labels for inputs.
This protocol's goal is to achieve consensus between multiple asynchronous processes, i.e. to find a preferred value of either 1 or 2 on which all processes agree. In the model distributed with PRISM, the processes share a global counter c with a range of [0.
where N is the number of processes and K is an integer constant. Initially, c is set to (K + 1) · N . All involved processes perform the following steps to locally determine a preferred value v: Each of those actions, flipping a coin, checking it and checking the value of c represents one step in the protocol. Since the processes execute asynchronously, their actions may be arbitrarily interleaved, whereby the interleavings are controlled by schedulers. A schedulers may choose from N inputs go i , one for each process i. Performing go i basically instructs process i to perform the next step in the protocol. If process i already picked a preferred value in Step 3.1. or in Step 3.2., go i is simply ignored.
The visible outputs of the system are sets of propositions that hold in the current step. Firstly, the propositions expose the current value of the shared counter, i.e. they include (c = k) for a k ∈ [0. . 2 · (K + 1) · N ]. Secondly, they expose values of the local coins, i.e. the outputs include one (coin i = x) for each process i, where x ∈ {heads, tails}. Additionally, the outputs may include a proposition finished, signalling that processes decided on a preferred value. As generating schedulers for this protocol in a learning-based fashion represents a demanding task, we only consider the case of K = 2 and N = 2, i.e. two asynchronously executing processes. Setting either of these constants to larger values significantly increases the number of steps to reach consensus.
Note that information about the current value of local coins is necessary to be able to generate optimal schedulers. Consider the property φ = F <5 c = 5 and max s P M,s (φ), which is equal to 0.75, for that. Initially, we have c = 6 and an optimal scheduler may choose any action, say go 1 . After that, we have coin 1 = heads with 0.5 probability and we should perform go 2 , because go 1 would increment c. After performing go 2 , we have coin 2 = heads with 0.5 probability and cannot satisfy φ anymore. All other traces would satisfy φ. Without knowledge about the state of local coins, we would not be able to make sensible choices of inputs. The randomised state machines controlling the processes remain a black box to us, though. Models of their composition are inferred via learning.
For the measurements, we optimise P M,s (F <k finished) for k ∈ {14, 20}, i.e. we try to find schedulings of the two processes which optimise the probability of finishing the protocol in less than 14 and 20 steps, respectively. Finishing here means that both processes picked a preferred value. In the experiments, we applied the configuration given by  is, maxRounds = 100, and n batch = 250, and p quit = 0.025. Since we know that states outputting finished are absorbing, we stopped sampling upon seeing the finished output, as suggested in [35]. Figure 8 shows evaluation results for the incremental and the monolithic approach in comparison to random testing. The box plots corresponding to each of these are labelled s mono , s inc and s unif , respectively. The dashed line represents the optimal probability as before. In contrast to previous experiments, we see that the monolithic approach may perform worse than random testing. For k = 14, there are three measurements, which are exactly zero, but more than a quarter of the measurement results are near-optimal. For k = 20, the number of experiments achieving lower estimations than random testing decreases to two, but none of the generated schedulers is near-optimal. This can be explained by considering the minimum number of steps necessary to reach finished. We need to execute at least 12 steps to observe finished. As a result, it may happen that relevant parts of the system, states reached only after 12 steps, are inaccurately learned. This exemplifies that incremental scheduler generation pays off, because it is able to generate near-optimal schedulers for both values of k. For k = 14, three quarter of the incrementally generated schedulers are near-optimal and for k = 20, more than one quarter of the schedulers are near-optimal.
This case study actually highlights a weakness of our convergence check. It assumes that the search will converge to some unique behaviour. The protocol is completely symmetric for both processes, so it does not matter which process performs the first step. Hence, there are at least two optimal schedulers which differ in their initial action. This action is present in each of the 250 traces collected in one round, which presumably include further ambiguous choices. This causes similarSched to return false in most of the cases. Consequently, we do not discuss results obtained with the convergence check, as it rarely led to early stopping. A possible approach to counter this problem would be, assuming there is a lexicographic ordering on inputs, to always select the lexicographically minimal input, in case the choice is ambiguous.

Convergence check
We discussed the influence and application of early stopping throughout this section. Now, we want to briefly examine the underlying assumption. The rationale behind the convergence check and early stopping is that scheduler behaviour converges with increasing number of rounds. As a result, fluctuations in the probability estimations produced by schedulers are expected to diminish. Ideally, estimations should increase over time as well, i.e. schedulers should improve. To investigate whether our assumptions hold, we applied the incremental approach, evaluated intermediate schedulers, and collected the produced probability estimations. Figure 9 contains graphs showing statistics summarising the collected estimations. The experiment summarised in Fig. 9a optimises reaching Pr10 in less than 14 steps with the slot machine. Figure 9b shows statistics for reaching goal in the gridworld in less than 10 steps. The graphs read as follows: the horizontal axis displays the rounds, and the vertical axis displays the value of the probability estimations. The lines from top to bottom represent the maximum, the third quartile, the median, the first quartile, and the minimum computed from the estimations collected in each round. Like before, these values were computed from 20 runs.
In both cases, we see that fluctuations decrease over time. The interquartile range decreases as well until it becomes relatively stable. Stable estimations are reached at around the 70th round in Fig. 9a, which is the area where convergence was detected -we stopped on average after 71.05 rounds. We see larger fluctuations of the minimal value in Fig. 9b, but they decline as well. Fluctuations of the minimal value can also be observed after 150 rounds. As a result, we may stop too early in rare cases. Figure 9b also reveals unexpected behaviour. Testing of the gridworld actually required relatively few rounds of learning to achieve good results. In particular, the estimations after the first rounds were larger than expected, because the basis for the first round of learning is formed by only n batch random tests.

Runtime
We simulated previously learned models for the evaluation, thus the simulation cost was very low in our experiments. As a result, the time for learning and reachability checking dominated the computation time. Table 8 lists the average runtime of learning and reachability checking for the property with the largest bound of each case study. It can be seen that the incremental approaches, denoted by s inc and s conv , require considerably more time to complete. Incremental scheduler generation without convergence detection for instance takes on average 728.6 s for the TCP property F <17 crash, while the monolithic approach requires only 24.7 seconds. Thus, the better performance with respect to maximising probability estimations comes at the cost of increased runtime for learning and scheduler generation. In a testing scenario with real-world implementations, however, this time overhead may be negligible. If network communication is necessary, e.g., for protocol testing, the simulation time required for interacting with the SUT can be assumed to dominate the overall runtime. To contrast simulation runtime to the runtime of learning and scheduler generation, consider the hypothetical, but realistic scenario in which each simulation step takes about 10 milliseconds. Both approaches, the monolithic and the incremental, require approximately the same number of simulation steps, about 1.7 · 10 6 for F <17 crash. In this scenario, the simulation duration would amount to about 4.7 h, such that the runtime overhead of 703.9 s caused by the incremental approach would be low in comparison. Similar observations can be made for other case studies. Since the time spent simulating the SUT can be expected to dominate the computation time, we conclude that the incremental approach is preferable to the monolithic approach in this context.
In Table 8, we also see that the convergence detection provides a performance gain for the slot machine, but causes slightly worse runtime for MQTT and TCP. The decreased performance is caused by the fact that convergence often could not be detected within the maxRounds used for s inc . Therefore, we increased maxRounds of s conv for MQTT and TCP, as discussed above. However, the goal of convergence detection is not a reduction of runtime. With the convergence detection heuristic, we want to provide a stopping criterion that does not solely rely on an arbitrarily chosen maxRounds parameter.
Finally, we want to discuss the runtime complexity of learning and scheduler generation. The worst-case complexity of IOAlergia is cubic in the size of the merged tree-shaped representation of the sampled traces, but the typical behaviour observed in practice is approximately linear [36]. Hence, it is unlikely that learning runtime could be improved, but the scheduler generation runtime can potentially be improved. Our implementation communicates with Prism via files, standard input and standard output. As a result, there is substantial communication overhead that could be removed via a tighter integration of scheduler generation. Prism's default technique for scheduler generation, which is called value iteration, could also benefit algorithmically from such a tight integration. Since we check reachability with respect to a bound k, we could also bound the number of iterations performed by value iteration by k [7]. This leads to a worst-case runtime complexity of O(k · n 2 · m), where n is the number of states of a learned model and m is the number of inputs. The number of inputs m is generally a small constant and we have observed that n 2 is generally smaller than the number of sampling steps required for learning. As a result, we expect the simulation time to generally dominate in non-simulated scenarios.

Discussion
We applied our approach for various types of models in several configurations, which we compared among each other, to the true optimal testing strategy and also to random testing as a baseline. The results of the performed experiments show (1) that learning-based approaches outperform the baseline and (2) that the incremental approach is able to generate near-optimal schedulers in all cases. In most experiments, the median probability estimation derived with the incremental approach was near-optimal, thus it generated near-optimal schedulers reliably. We have also seen that the convergence detection heuristic did not have a negative impact on the accuracy of the incremental approach. However, we are not able to give concrete bounds on the required number of samples to achieve a desired success rate. This is due to the fact that we rely on IOAlergia, for which convergence in the limit has been shown [36], but stronger guarantees are currently not available.
We generally targeted systems with small state space. An application in practice therefore requires abstraction to ensure that the state space is not prohibitively large. This is generally required in learning-based verification and several applications have shown that learning combined with abstraction provides effective means to enable model-based analysis [20,21,44,49].
In addition to that, we have also seen limitations that cannot be solved by abstraction. The differences between estimations and optimal values tend to increase with the step bound k. This is potentially caused by the exponential growth of different traces. This growth also affects the application of the approach for large gridworld examples. Increasing the width of the gridworld also increases the steps required to reach the goal and causes the performance to drop. A possible mitigation would be to identify disabled inputs, i.e. inputs rejected by the SUT, if we have such information. This might prevent certain traces from being executed beyond a disabled input. In the original version of IOAlergia [35], such knowledge facilitates learning, because disabled inputs are assumed to leave the current state unchanged. In the gridworld example, we may consider inputs to be disabled, if they cause the robot to move into a wall.
A related issue also affects the case study on the consensus protocol. Changing the number of processes from two to four, increases the minimum number of steps to reach finished to 24. Here, composition actually causes the state space to grow. We could tackle such problems via decomposition. Instead of learning a large monolithic model, several small models could be learned, which would then be composed for the reachability analysis.

Conclusion
We presented an approach to infer near-optimal schedulers for reachability objectives of MDPs. To our knowledge, it is the first such approach to be applicable in a purely blackbox setting, where only the input interface is known. This is accomplished by incrementally refining the knowledge about the system, via model inference and based on that propertydirected exploration of the system. Section 5 presents promising results, showing that nearoptimality can be achieved.
Therefore, we plan to investigate this approach further and extend it. As a first step, we are currently evaluating the method on more case studies. In order to be able to examine more complex systems, we are planning to work on compositional verification. That is, we are investigating how to benefit from decomposition, as opposed to treating composed systems as large monolithic systems. We are also studying the applicability in different testing scenarios. In a testing context, e.g., Nachmanson et al. [38] discussed strategies for bounded reachability games, but with a given model. Furthermore, non-functional properties like execution time could be considered. For that, we need to devise a model-inference technique which considers both non-deterministic choice of inputs and (continuous) time. Current approaches for probabilistic timed systems do not account for non-determinism of this form [36,52]. If we had such a technique, we could, e.g., use Prism with the digital clocks engine [29] or Uppaal Stratego [15,16] to infer schedulers. Another possible extension would be to consider more general properties than reachability. This would require replacing Prism's scheduler generation in our approach. In conclusion, we believe that our results are encouraging and that there are many promising directions for future research.