Quantitative Veriﬁcation and Strategy Synthesis for Stochastic Games

—Design and control of computer systems that operate in uncertain, competitive or adversarial, environments can be facilitated by formal modelling and analysis. In this paper, we focus on analysis of complex computer systems modelled as turn-based 2 1 / 2 -player games, or stochastic games for short, that are able to express both stochastic and non-stochastic uncertainty. We offer a systematic overview of the body of knowledge and algorithmic techniques for veriﬁcation and strategy synthesis for stochastic games with respect to a broad class of quantitative properties expressible in temporal logic. These include probabilistic linear-time properties, expected total, discounted and average reward properties, and their branching-time extensions and multi-objective combinations. To demonstrate applicability of the framework as well as its practical implementation in a tool called PRISM-games, we describe several case studies that rely on analysis of stochastic games, from areas such as robotics, and networked and distributed systems.


I. INTRODUCTION
Since the dawn of the information age, correctness and safety of computer systems have been central to their design and analysis.Computer systems typically operate in uncertain environments.The uncertainty can be stochastic due to, e.g., unreliable communication media, faulty components or simply due to the use of randomisation.Moreover, if components that cannot be controlled are present in the environment, their adversarial or competitive behaviour results in additional, non-stochastic uncertainty.Examples of such systems appear in many domains, from robotics and autonomous transport, to security, networked and distributed systems, and power management.
It is natural to view such complex systems as games between the controllable computer system and its (uncontrollable) environment.In this work, we present a comprehensive overview of techniques used in verification and controller (also called a strategy) synthesis for systems modelled as 2 1 /2-player games, or stochastic games for short.In every step of a stochastic game, the two players, Player 1 and Player 2, choose their moves and, based on their choices, the next state of the game is determined, possibly in a probabilistic fashion.Controller synthesis can then be viewed as finding a winning strategy for Player 1, where Player 2 may play adversarially.Stochastic games have been employed, for example, to support decision making and synthesise controllers for aircraft power distribution [1], sensor network management in renewable energy production plants [2], in human-inthe-loop UAV planning [3], and autonomous driving in the The authors are with Department of Computer Science, University of Oxford, Oxford, UK, firstname.lastname@cs.ox.ac.uk.This work was supported by ERC Advanced Investigators Grant VERIWARE and EPSRC Mobile Autonomy Programme Grant EP/M019918/1.
presence of hazards such as pedestrians [4], [5].They arise naturally in the context of security and defence, where they have been used in patrol planning [6], port defence [7], infrastructure protection [8], to generate countermeasures for DNS bandwidth attacks [9] and to analyse complex attackdefence scenarios in RFID goods management system [10].Through the use of abstraction and discretisation, highlevel control of hybrid and continuous systems can also be addressed.
Stochastic games were first introduced by Shapley in 1953 [11].Various classes and modifications of these games have been extensively studied since then and surveyed in, e.g., [12], [13], [14], [15], [16], [17], [18].In this survey, we focus on turn-based games, where players choose their moves in turns rather than concurrently as in [11], [12], [13].More specifically, we restrict our attention to turnbased, finite, complete-observation, stochastic, discrete-time, zero-sum games.We also consider a generalisation of these games to multiple players.Compared to existing surveys of these games such as [14], the distinguishing feature of our survey is a comprehensive coverage of algorithms for temporal logic properties, including reward and multiobjective properties not covered in [14], and an illustration of their practical application on a tutorial-style example.Other surveys typically focus on related classes of games, to mention concurrent games [11], [12], [13], or only on a subclass of properties, e.g., single-objective [14].
We study various classes of properties of stochastic games expressible in temporal logic.First, we consider quantitative probabilistic properties over linear time, expressed as formulas of probabilistic linear temporal logic.Examples of such properties include 'the maximum probability of the airbag failing to deploy within 0.02 seconds is at most 10 −6 ', or 'the minimum probability of the car to reach its destination without colliding with pedestrians, while obeying traffic rules, is at least 1 − 10 −10 '.Next, properties reasoning about rewards associated with states of the game are introduced.Namely, we consider the expected total and discounted cumulative reward as well as long-run average reward.These can be used to state properties such as 'the minimum expected profit that the investor can guarantee within a year is at least 1000', or 'the expected number of requests served per time unit in the network is at least 5'.Finally, we allow to combine the above linear-time and reward properties to express requirements over branching time, thus allowing analysis of properties such as 'the probability that the network recovers from a bad decision to a state from which a consensus can be reached with probability at least 0.9 is at least 0.95'.
Given a stochastic game and a property, verification and strategy synthesis problems, respectively, focus on the existence and construction of a strategy for Player 1 that guarantees satisfaction of the property against all strategies of Player 2. In this work, we discuss general findings for the two problems and overview the existing algorithmic solutions for various classes of properties.The solutions typically rely on a reduction to simpler games or properties, and the computation of optimal values and strategies, for which a value iteration algorithm is typically utilised.In the multiplayer case, a coalition of players aims to cooperatively enforce a property.Intuitively, multi-player stochastic games can be seen as stochastic games with two players, where the coalition acts as Player 1 and the remaining set of players as Player 2. Finally, we analyse stochastic games with respect to multi-objective properties that require simultaneous satisfaction of multiple linear-time and reward properties.Here, the properties can be conflicting and the techniques reduce to the computation of an ε-approximation of the Pareto set of optimal trade-offs between the individual properties.While a number of software tools exist with partial support for stochastic games, see Sec.V for a summary, they only allow a subclass of such games, e.g., with one player or without stochasticity, or perform analysis of stochastic games against single-objective properties only.On the other hand, most of the overviewed algorithms have been implemented within the tool called PRISM-games [19], [20] for modelling, verification, synthesis and simulation of stochastic games, an extension of the PRISM model checker [21].We briefly overview the features and functionality of the tool and offer a number of case studies, where complex computer systems have been modelled as stochastic games and their properties analysed in PRISM-games.
Contributions of this paper can be summarised as follows: • we present a comprehensive framework for analysis of stochastic games, focusing on high-level temporal logic specifications; • we overview the existing body of knowledge and algorithmic solutions for verification and strategy synthesis problems for stochastic games, and identify open problems; • we offer a list of case studies of control systems that are modelled and analysed through stochastic games.
The remainder of this paper is organised as follows.In Sec.II, we introduce stochastic games and define a specification language for linear-time and reward properties.In Sec.III, we formulate the verification and strategy synthesis problems for single-objective properties, present general findings for the problems, as well as algorithmic solutions, and their extensions to branching-time properties and multi-player games.Multi-objective combinations of properties are then discussed in Sec.IV.We overview existing tools for games in Sec.V and briefly describe the functionality of PRISM-games, which currently provides the most comprehensive support for stochastic games.Finally, we list several case studies that rely on stochastic games in Sec.VI.We finish with concluding remarks in Sec.VII.To demonstrate the framework, an illustrative example modelled and analysed in PRISM-games is used throughout the paper.

A. Notation
We use D(X) to denote the set of all probability distributions over a set X. Given a finite or infinite sequence λ of elements of X, we use λ i to denote its i-th element for i ≥ 0. For a finite sequence λ = x 0 x 1 . . .x k of elements of X, we use |λ| = k + 1 to denote the length of the sequence and last(λ) = x k denotes its last element.

B. Stochastic Games
Definition 1 (Stochastic game): A turn-based 2 1 /2-player game or simply a stochastic game is a tuple G = (S, (S 1 , S 2 , S p ), ∆), where S is a finite set of states partitioned into sets S 1 , S 2 and S p of Player 1, Player 2 and probabilistic states, respectively, and ∆ : S × S → [0, 1] is a probabilistic transition function such that, for states s ∈ S 1 ∪ S 2 , it holds that ∆(s, s ) ∈ {0, 1} for every s ∈ S, where we assume that ∆(s, s ) = 1 for at least one s ∈ S, and for states s ∈ S p , it holds s ∈S ∆(s, s ) = 1.
Intuitively, the game is played as follows.The state of the game is always determined uniquely and, in every step, the next state is chosen according to the transition function.When the current state of the game is a Player 1 state, i.e., s ∈ S 1 , then Player 1 chooses the next state s ∈ S such that ∆(s, s ) = 1, and similarly for Player 2. When the current state is probabilistic, i.e., s ∈ S p , the next state of the game is sampled according to the distribution ∆(s, •).
Formally, a path of a game G is an infinite sequence λ = s 0 s 1 . . .such that ∆(s i , s i+1 ) > 0 for all i ≥ 0. A finite path of G is a finite prefix of a path.We use Path G,s to denote the set of all paths originating in a state s ∈ S and Path G = ∪ s∈S Path G,s .The sets FPath G,s , FPath G of finite paths are defined analogously.Given two states s, s ∈ S, we say that s is reachable from s if and only if there exists a finite path λ such that λ 0 = s and last(λ) = s .
Definition 2 (Labelling function): Given a finite set of atomic propositions AP, a labelling function L : S → 2 AP assigns to each state s ∈ S of the game a set of atomic propositions that hold true in s.
While the term reward intuitively suggests that the goal will be to maximise functions over these values, we use a reward structure as a general value assignment and consider minimisation problems as well.In such a case, the values are often referred to as costs rather than rewards.By inverting the signature of all rewards, the resulting function is again a reward structure, and this will allow us to translate minimisation problems to maximisation.Note that, unless stated otherwise, in this work we do not consider reward structures that assign both negative and positive values.
Stochastic games as defined above were first studied with respect to reward properties in [22], [23], as a special case of games originally defined by Shapley [11].With respect to reachability properties, simple stochastic games were studied in [24].In a simple stochastic game, every state has exactly two successors and all transitions from probabilistic states have probability 0.5, simulating a coin toss.More complex, temporal properties of stochastic games were then introduced in [25].
Our definition of a stochastic game (Def. 1) is based on the definition used, e.g., in [24], [14].An alternative way to define such games is to partition the state space only into Player 1 and Player 2 states and introduce a finite set of actions.The transition function then defines at most one probability distribution for each pair of a state and an action.A reward structure assigns values to pairs of states and actions.Note that this definition of games, appearing e.g., in [11], [22], is equivalent to Def. 1.
Stochastic games include as a subclass many interesting and widely studied models.If there are no probabilistic states, i.e., S p = ∅, the game is called a 2-player game or a non-stochastic game.Similarly, if the game only has one player and probabilistic states, i.e., S 1 = ∅ or S 2 = ∅, it is called a 1 1 /2-player game or a Markov decision process (MDP).Moreover, if also S p = ∅, the game is a transition system.Finally, if both S 1 = ∅ and S 2 = ∅, the game reduces to a Markov chain.
Definition 4 (Strategy): A Player 1 strategy is a tuple π = (M, π u , π n , π init ), where M is a countable set of memory elements, π u : M × S → M is a memory update function, π n : M × S 1 → D(S) is a next move function such that π n (m, s)(s ) > 0 only if ∆(s, s ) > 0, and π init : S → M is an initial memory element function.A Player 2 strategy σ = (M, σ u , σ n , σ init ) is defined analogously.
Intuitively, strategies prescribe the behaviour of players as follows.Given a Player 1 strategy π, first, an initial memory element is chosen according to the function π init .Then, in every step of the game, Player 1 updates the current memory element based on the current state of the game, using the memory update function π u .Moreover, if the game is in a Player 1 state, Player 1 chooses the next state of the game using the next move action π n .Player 2 strategies are applied in an analogous way.
We use Π and Σ to denote the set of all Player 1 and Player 2 strategies, respectively.A (finite) path under strategies π ∈ Π, σ ∈ Σ is any (finite) path resulting from Player 1 playing according to strategy π and Player 2 playing according to strategy σ.We use notation Path π,σ G , Path π,σ G,s , FPath π,σ G and FPath π,σ G,s with obvious meaning.Generally, a Player 1 strategy π is randomised.It is called pure if the next move function π n is of type π n : M × S 1 → S. Similarly, π is a memoryless strategy if M is a singleton, a finite memory strategy if M is finite, and an infinite memory strategy in the general case.For simplicity, we consider pure memoryless strategies to be functions of type π : S 1 → S. Player 2 strategies are classified in the same way.
Let s ∈ S be a state of a game G and let π ∈ Π, σ ∈ Σ be a Player 1 and Player 2 strategy, respectively.Given a finite path λ ∈ FPath π,σ G,s , the cylinder set Cyl(λ) is the set of all paths in Path π,σ G,s that have λ as a prefix.Consider the σ-algebra Σ of paths generated by the set of all such cylinder sets.According to classical probability and measure theory [26], there exists a unique probability measure Pr π,σ G,s over Σ such that for all λ ∈ FPath π,σ G,s .Given a random variable ρ over the probability space (Path π,σ G,s , Σ, Pr π,σ G,s ), the expected value of ρ is defined as A game is called stopping if it has at least one terminal state and if it holds that, for every pair of strategies π ∈ Π, σ ∈ Σ and every initial state, with probability 1 the game eventually stops, i.e., a terminal state is reached.
The principle of stopping games was introduced in [11], where games were first studied with respect to reward properties, to avoid infinite accumulation of rewards.The original definition imposed a stronger assumption, not required for the results discussed here, that in every step the game stops with non-zero probability.The more general notion of stopping as in Def. 5 appears, for example, in [24].
Note that we do not consider partial observability and assume the games are finite.Several related, more general stochastic game models exist, which include concurrent games [11], [12], [13], partial-observation games [17] and uncertain (or bounded-parameter) MDPs [18].In particular, uncertain MDPs generalise stochastic games to infinite games by considering Player 2 with an infinite set of actions, typically defined through convex uncertainty sets.The motivation comes from the fact that, in many practical problems, estimating transition probabilities of systems with stochastic uncertainty from data may be very difficult but they can be over-approximated using sets.These models have a strong connection to partial-observation and concurrent games, see e.g., [27], [28], [29].
Example 1: Consider the stochastic game G = (S, (S 1 , S 2 , S p ), ∆) depicted in Fig. 1(a).It can be seen as a simplified version of the autonomous driving case study presented in [5].The game models a car driving in an urban area that is given a route to navigate through.While navigating the route, it has to autonomously react to hazards.We consider two hazards, a traffic jam and a pedestrian.In order to avoid a hazard, the car chooses to perform one of three reactions, namely, change lane, honk or brake.The game proceeds as follows.Starting from the probabilistic state s 0 , a hazard is encountered or the car successfully finishes its route by entering the terminal state labelled with atomic proposition succ, shown in green in Fig. 1(a).In the former case, each of the two hazards appears with probability 0.3 and, with probability 0.2, the choice of a hazard is left to Player 2. The outcome of the three reactions to each hazard is indicated in Fig. 1 We consider two reward structures, r energy and r time , over G defined in Fig. 1(b) that represent the energy and time demands of individual reactions to hazards, respectively.
We modelled the game in the tool PRISM-games and analysed it with respect to several properties.The corresponding input files for PRISM-games can be found in [30], and we report on the results in the following sections.

C. Properties
In this work, we are interested in both temporal and reward properties of stochastic games.The specification language defined below allows us to formulate such properties over linear time, namely probabilistic linear-time properties, and various expected reward functions.The language is motivated by the language used by the PRISM model checker [21] and its extension for stochastic games known as PRISMgames [19], [20], based on probabilistic Computation Tree Logic (PCTL) and its extension PCTL* [31], which combines Linear Temporal Logic (LTL) together with the probabilistic and reward operators in a CTL-like branching-time fashion.
The semantics of the properties is listed in Table I.In particular, formulas φ, i.e., the probabilistic and reward operator, are interpreted over states of the game, and linear temporal formulas ψ and reward functions are interpreted over paths of the game.Below, we give a brief description grouping properties from Def. 6 into categories according to their syntactic form.
Probabilistic reachability: Properties of the form P p [F a], where F a = true U a, are called probabilistic reachability properties.Given a state s, the property requires that there exists a Player 1 strategy π ∈ Π such that, for all Player 2 strategies σ ∈ Σ, the probability of reaching a state labelled with atomic proposition a, starting from s, under the two strategies satisfies the bound p. Probabilistic reachability properties represent a simple but fundamental class of properties since many properties of games can be reduced to probabilistic reachability.A step-bounded probabilistic reachability is a property of the form P p [F ≤k a], where k ∈ N 0 , F ≤k a = k i=0 X i a and X i is an abbrevation for a sequence of i consecutive instances of the operator X.In this case, the goal is to visit a state labelled with atomic property a within the first k steps of a path.Finally, properties of the form P ≥1 [F a] are called almost-sure reachability properties.
Probabilistic LTL: Properties of the form P p [ψ] are generally referred to as probabilistic LTL properties since, according to Def. 6, ψ can be an arbitrary LTL formula [32].Recently, LTL has been increasingly often used in various areas of control as it is expressive enough to describe many interesting properties of systems and, at the same time, it resembles natural language statements.Examples of properties that can be expressed in LTL include reachability F a, safety G a = ¬F ¬a, liveness G(a ⇒ F b), persistent surveillance GF a or stability FG a. Similarly to probabilistic reachability, almost-sure LTL properties are properties of the form Total reward properties: Properties of the form R r x [ρ], where ρ is equal to C, C ≤k or F * a, are called total reward properties.They are concerned with the expected cumulative reward collected in states of the game over infinite time horizon (C), in the first k steps of a path for k ∈ N 0 (C ≤k ), or until a state labelled with proposition a is reached (F * a).In the latter case, if such a state is never visited, we allow to treat the cumulative reward in different ways through the use of the flag * .Namely, we consider the reward being zero ( * = 0), infinity ( * = ∞), or we allow the reward to accumulate indefinitely ( * = c).Just as reachability properties are fundamental probabilistic properties, total reward properties are fundamental reward properties.
Discounted reward properties: Properties of the form R r x [D β ] analyse the expected cumulative reward over infinite time horizon, where the collected rewards are increasingly discounted by the discount factor β ∈ (0, 1).
Average reward properties: Finally, properties of the form R r x [S] analyse the expected average reward collected in states of the game over infinite time horizon.

III. SINGLE-OBJECTIVE GAME SOLVING
In this section, we first formulate the problem of solving a stochastic game with respect to a property and discuss general findings for this problem.Next, we overview the algorithmic solutions for different types or properties.Finally, we discuss extensions of these techniques to properties over branching time and multi-player stochastic games.A Player 1 strategy that is a solution to Problem 2 is called a winning Player 1 strategy.Conversely, Player 2 aims to violate the property φ and a winning Player 2 strategy is such that, for all Player 1 strategies, the property is not satisfied.

A. Problem Formulation
Games with this semantics are called zero-sum games since the objectives of the two players are complementary.
In order to solve both verification and strategy synthesis problems, we consider the optimal values of path formulas ψ and reward functions ρ from Def. 6 defined as follows: A Player 1 strategy π ∈ Π starting from state s is called optimal if it achieves the optimal value, e.g., sup σ∈Σ Pr π,σ G,s (ψ) = Pr min G,s (ψ).Similarly, the strategy is called ε-optimal, for ε > 0, if it achieves a value deviating by at most ε from the optimum, e.g., sup σ∈Σ Pr π,σ G,s (ψ) ≥ Pr min G,s (ψ) + ε.A stochastic game is called determined with respect to a chosen optimality criterion if the corresponding equality holds: Determinacy also guarantees existence of ε-optimal strategies for all ε > 0 for both players from every state.A deep result in [33] established determinacy for a large class of games including stochastic games with respect to any Borel measurable property and, in particular, with respect to all properties in Def. 6.Note that determinacy does not necessarily imply the existence of optimal strategies.However, for all classes of properties described in Sec.II-C it has been shown that both players have optimal strategies and pure memoryless strategies suffice, except for the step-bounded properties, the class of general probabilistic LTL properties and total reward properties with ρ = F 0 a, where pure finite-memory strategies may be required, see [34], [14], [35] and references therein.The optimal values and strategies can be used to solve the verification problem stated in Problem 1 in the following way.For example, to solve the verification problem for G, s and property P ≥p [ψ], it suffices to verify that Pr max G,s (ψ) ≥ p.The remaining properties in Def.6 can be addressed in an analogous way.To solve the strategy synthesis problem stated in Problem 2, we compute an optimal or a suitable ε-optimal Player 1 strategy.Together with the existence of optimal strategies, this implies that, for every property in Def.6, there exists a winning strategy for one of the players.
To directly address the computation of optimal values and strategies, we extend the syntax of properties to include numerical queries for computing optimal strategies.Definition 7 (Numerical query): Let ψ, r and ρ be as defined in Def. 6. Numerical queries P min=?[ψ], P max=?[ψ], R r min=?[ρ] and R r max=?[ρ] aim to compute the optimal values defined in Eq. 1, respectively, together with the corresponding optimal Player 1 strategies.
As discussed above, the problem of computing the optimal values for states, called quantitative query solving, and constructing an optimal Player 1 strategy, called strategic query solving, are two separate problems utilised to solve the verification and strategy synthesis problems for games.In [34], it has been shown that all problems of quantitative and strategic query solving for probabilistic reachability, total, discounted as well as average reward properties are polynomially equivalent, i.e., there exists a polynomial-time reduction between the problems.The computational complexity of these problems is NP ∩ coNP.No polynomial-time algorithm is known, even for reachability objectives, and the widely used exponential-time algorithm for the corresponding quantitative query solving problems, i.e., computing the optimal values for states, is a value iteration algorithm, presented in detail later in this section.It is not necessarily true that the strategic solution can be easily derived from the quantitative solution, i.e., an optimal strategy might not be easily constructed from the optimal values.For the case of probabilistic reachability, total and discounted reward this nevertheless is the case, and optimal strategies can be constructed from the optimal values in linear time.For average reward properties, the existence of a similar (even polynomial-time) algorithm remains an open question [34].
In the following Sec.III-B to III-F, we discuss algorithmic solutions to the verification and strategy synthesis problems for properties in Def.6 as classified in Sec.II-C.Firstly, note that from determinacy we get ), and hence it suffices to discuss maximisation numerical queries.In Sec.III-G, we discuss a generalisation of these techniques to a logic that combines properties from Def. 6 in a PCTL*-like fashion to obtain properties over branching time.Finally, Sec.III-H overviews an extension of verification and strategy synthesis for stochastic games with two players to general, multi-player stochastic games.

B. Probabilistic Reachability
Consider the numerical reachability query P max=?[F a] for a game G with an initial state s ∈ S. To quantitatively solve the query, we can use an adaptation of the value iteration algorithm that first appeared in [24] for simple stochastic games defined in Sec.II-B.The algorithm computes the optimal values for Player 1 states s ∈ S 1 as where v * n (s) is iteratively computed as indicated in Fig. 2. While the limit may not converge in finite time, a precision threshold α can be computed such that, if the value iteration algorithm is terminated once the maximum difference between v * n (s) and v * n+1 (s), for s ∈ S, is not more than α, the limit values can be obtained by simple rounding [36].Moreover, using this procedure, the algorithm always stops in a number of iterations that is at most exponential in the size of the game.It was proven in [34] that, given the quantitative solution to the query, the necessary and sufficient conditions for the strategic solution, i.e., a pure memoryless optimal Player 1 strategy π * , are the following.First, an optimal strategy π * satisfies, for every s ∈ S 1 , the set membership and second, under π * , the game reaches a state labelled with proposition a with non-zero probability starting from any state s ∈ S such that v * (s) > 0, under any Player 2 strategy.An optimal Player 1 strategy can be constructed from the optimal values in time linear in the size of the game using a technique called retrograde analysis, first introduced in the artificial intelligence community to solve chess endgames [37].Intuitively, the strategy is computed from states labelled with a, using backward propagation.For details of the construction, see [34].
The minimisation queries P min=?[F a] can be solved analogously.The value iteration algorithm can also be used for a fixed number of iterations to approximate the optimal values and strategy.For example, to solve the strategy synthesis problem for properties of type P ≥p [F a], the maximisation query is considered and value iteration is terminated once the value v * n (s) is greater or equal to p for the chosen state s ∈ S and the corresponding Player 1 strategy is then computed in a way similar to the optimal case above.
Besides the value iteration algorithm, one can adapt the equations from Fig. 2 to design a quadratic program and a strategy iteration algorithm that iterates over pure memoryless strategies [38].For a chosen Player 1 strategy π ∈ Π, the strategy iteration algorithm computes the optimal values v * π obtainable by Player 2 if Player 1 plays according to π, and then the algorithm locally improves π to achieve better values for Player 1.While the best known bound for the number of iterations is exponential, the algorithm performs well in practice and no class of games is known for which an exponential number of iterations is required.
Finally, there also exists a randomised subexponentialtime algorithm to solve the problem.For simple stochastic games, it was introduced in [39] and can be extended to general stochastic games as shown in [40].Intuitively, the algorithm randomly tries to guess the optimal transition in a chosen Player 1 state of the game and verifies whether the best strategy with the chosen transition is optimal.If not, it removes the transition and proceeds.
For the special case of almost-sure reachability properties P ≥1 [F a], the strategy synthesis problem can be solved in quadratic time as follows.The problem is first reduced to an equivalent strategy synthesis problem for a 2-player, non-stochastic game with a reachability property as shown in [41], which is then solved using a simple graph algorithm, see, e.g., [14] and references therein.
For step-bounded numerical reachability queries P min=?[F ≤k a] and P max=?[F ≤k a], pure finite-memory strategies need to be considered.Intuitively, longer paths with higher probability of reaching a state labelled with proposition a may be preferred when enough time remains until the deadline k, whereas shorter paths with lower probability might need to be considered when the deadline is approaching.The optimal values and a Player 1 strategy for a game G can be computed by performing a fixed number k of iterations of the value iteration algorithm on the game G ≤k that is an extension of the game G to keep track of the number of steps performed.Formally, G ≤k has states of the form (s, i), where s ∈ S and i ∈ {0, 1, . . ., k}, and the transition function is ).We conclude this section by commenting on a reduction from reachability to reward properties.The strategy synthesis problem for a game G, state s ∈ S and probabilistic reachability property P p [F a] can be reduced to strategy synthesis for a game G , s and a total reward property R r p [F c a], where the game G is G with a new probabilistic terminal state s f .The transitions of all states s ∈ S labelled with proposition a are defined as ∆(s, s f ) = 1 and ∆(s, s ) = 0 otherwise.The reward structure is defined as r(s) = 1 for all s ∈ S labelled with a and r(s) = 0 otherwise.
Example 2: Recall the autonomous car example introduced in Ex. 1, modelled with the stochastic game G shown in Fig. 1(a) with starting state s 0 ∈ S p .Consider the numerical query P max=?[F succ] to determine the maximal probability of reaching a state labelled with proposition succ indicating that the car successfully finished the route.The maximal probability is approximately 0.87 and the corresponding pure memoryless optimal strategy is to brake for both hazards.Next, consider the step-bounded numerical query P max=?[F ≤k succ].We computed results for step bounds 1 ≤ k ≤ 30.The optimal values can be observed in Fig. 3.For k ≤ 3, the only way for the car to successfully finish the route in at most 3 steps starting from s 0 is not to encounter any hazards, and thus the maximum probability is 0.2.For k = 4 and k = 5, the maximum probability is approximately 0.32 and 0.35, respectively, and there exists a pure memoryless optimal strategy that always brakes in a traffic jam and honks when approaching a pedestrian.Finally, for k ≥ 6, the maximum probability is gradually increasing with k and, here, optimal strategies are pure, but require finite memory.To be specific, an optimal strategy is to always brake in a traffic jam, and when approaching a pedestrian react as follows.If at most 4 or at least 6 steps remain until the bound k, then honk.If exactly 5 steps remain to successfully finish the route, then brake.

C. Probabilistic LTL
The standard approach to solving numerical queries of the form P max=?[ψ] with an arbitrary LTL formula ψ is to translate the formula ψ into a deterministic Rabin automaton [42] of the size up to doubly exponential in the size of the formula.Since LTL formulas considered in control are typically small, the size of the corresponding automaton is manageable.The synchronous product of the game G and the automaton is a Rabin stochastic game.Such games can be solved by combining the value iteration algorithm for reachability queries with any algorithm for Rabin nonstochastic games as presented in [43].Alternatively, the deterministic Rabin automaton can be translated to a different type of automaton called a parity automaton [44] and the product parity stochastic game can be solved using the strategy iteration or randomised subexponential algorithm presented in [45].
Similarly as for the reachability properties, the problem of solving a stochastic game with respect to an almost-sure LTL property P ≥1 [ψ] can be solved by reducing the corresponding almost-sure Rabin stochastic game to an equivalent Rabin 2-player, non-stochastic game [41].An overview of existing algorithms, exponential and deterministic subexponential, for non-stochastic Rabin games can be found in [14].
Example 3: Consider the safety numerical query P max=?[G ¬acc] for the autonomous car example from Ex. 1.The query aims to compute the maximum probability with which an accident can be avoided.The maximum probability is approximately 0.92, which is higher than the maximum probability of successfully finishing the route computed in Ex. 2. The reason is that violation of road rules is not considered an accident.There exists a pure memoryless optimal strategy, namely, to change lane in a traffic jam and to brake when approaching a pedestrian.

D. Total Reward Properties
To solve the numerical query R r max=?[ρ], techniques similar to those for probabilistic reachability as described in Sec.III-B can be applied.
First, let ρ = C and assume that the considered game is stopping, all rewards are non-negative r : S → R ≥0 and terminal states have reward 0. This means that the expected total reward is always finite.The optimal values for Player 1 states s ∈ S 1 are defined as where v * n (s) is iteratively computed as indicated in Fig. 4. Similarly as for the value iteration in Sec.III-B, the limit is not guaranteed to converge in finite time, but using a precision threshold the limit values can be computed in a number of iterations that is at most exponential in the size of the game [36].Unlike in the reachability case, since the game is assumed to be stopping, there is only one necessary and sufficient condition for a (pure memoryless) optimal Player 1 strategy π * .Namely, for every state s ∈ S 1 , it must hold that π * (s) ∈ arg max An optimal strategy can thus be constructed from the optimal values in linear time using the above equations.Maximisation queries with non-positive rewards r : S → R ≤0 , as well as minimisation queries, can be solved in an analogous way, and the value iteration can be used for a fixed number of iterations to approximate the optimal value and strategy to solve properties of the type φ = R r x [C].For non-stopping games, the set of states that receive infinite total reward can be computed by solving the game with respect to a parity condition [2].After removing these states, value iteration algorithm can be applied to compute the (bounded) optimal values for the remaining states.
Next, let ρ = F * a.For * = c, the query can be reduced to the case above, with ρ = C, by adding a new terminal state s f with r(s f ) = 0 and altered transitions for states s such that a ∈ L(s) by letting ∆(s, s f ) = 1 and ∆(s, s ) = 0 otherwise.For * = ∞, we proceed in a similar fashion.However, while value iteration in Fig. 4 computes the least fixed point, in this case we need to compute the greatest fixed point as zero reward paths that do not reach a state labelled with proposition a need to be identified.This can be done using computation in Fig. 4 for an altered game, where all zero rewards are changed to an arbitrary ε > 0 [35].
if n > 0 and s ∈ S p .Finally, for * = 0, the optimal strategy may depend on the rewards accumulated so far and pure finite memory strategies suffice for Player 1 to win.The computation combines value iteration algorithms from Fig. 2 and 4, see [35] for details.
Step-bounded numerical queries with ρ = C ≤k can be solved using a fixed number k of iterations of the value iteration algorithm on the game G ≤k defined in Sec.III-B and optimal strategies might thus require memory.
Example 4: For the autonomous car in Ex. 1, we compute the minimum expected total energy and time demands before successful route completion, R renergy min=?[F c succ] and R r time min=?[F c succ].For energy, the minimum value is approximately 3.33 with the pure memoryless optimal strategy to honk for both hazards.For time, the minimum value is approximately 2.71 with the pure memoryless optimal strategy to change lane for both hazards.Note that changing lane in a traffic jam might violate traffic rules, resulting in entering the grey terminal state in Fig. 1(a).While in such a case the route cannot be successfully completed more, the time cost drops to 0 for all the following steps, and thus changing lane, while potentially violating road rules, results in lower expected total time.In comparison, if the time cost assigned to the terminal state corresponding to traffic rules violation was 1, the optimal strategy would be to brake in a traffic jam and change lane when approaching a pedestrian with the expected total time approximately 3.46.

E. Discounted Reward Properties
To solve numerical queries of the form R r max=?[D β ], we present their reduction to probabilistic reachability queries [34], as well as to total reward queries [46].
Let G be a stochastic game with a reward structure r.First, we describe the reduction to probabilistic reachability.Without loss of generality, assume that all rewards take values in the interval [0, 1].Construct a game G = (S , (S 1 , S 2 , S p ), ∆ ) defined as follows.First, add two terminal probabilistic states s 0 , s 1 ∈ S p .Next, for every Player 1 or Player 2 state s ∈ S 1 ∪ S 2 and every s ∈ S such that ∆(s, s ) = 1, we add a new probabilistic state t s,s ∈ S p and define ∆ (s, t s,s ) = 1, ∆ (s, s ) = 0.For probabilistic states t s,s , we let Finally, for probabilistic states s ∈ S p we define Consider AP = {a} and the labelling function defined as L(s 1 ) = {a} and L(s) = ∅ otherwise.It holds that every optimal strategy for G with respect to the numerical query P max=?[F a] is also an optimal strategy for G with respect to the numerical query R r max=?[D β ].Next, we present a reduction to total reward queries that builds on the same principles.Construct a game G = (S , (S 1 , S 2 , S p ), ∆ ) and a reward structure r defined as follows.First, for every s ∈ S 1 ∪ S 2 add a new probabilistic state t s ∈ S p and add a new terminal probabilistic state s f ∈ S p .For all states s, s ∈ S such that ∆(s, s ) > 0 define ∆ (s, t s ) = ∆(s, s ), ∆ (s, s ) = 0.For probabilistic states t s , we let The reward structure r is such that r (s) = r(s) for s ∈ S and the reward is 0 otherwise.It holds that every optimal strategy for G with respect to numerical query R r max=?[C] is also an optimal strategy for G with respect to the numerical query R r max=?[D β ].

F. Average Reward Properties
Unlike other infinite horizon properties such as total reward, expected average reward disregards all transient behaviour.Nevertheless, it was proven in [22], [23] that pure memoryless strategies still suffice for both players to win.In [23], the authors show that, for a game G with an average reward numerical query R r max=?[S], there exists a discount factor β such that any strategy optimal in G with respect to discounted reward numerical query R r max=?[D β ] is also optimal with respect to the average reward query.Moreover, for a strategy to be optimal for the latter, it suffices if it is optimal for the former for every β sufficiently close to 1.The authors in [34] then compute a concrete value of the discount factor and prove that a solution to numerical queries of the form R r max=?[S] can be found as a solution to numerical query R r max=?[D β ] for any β ∈ [β * , 1), where Alternatively, it has been shown in [47] that, unlike for the general case, for stochastic games that are ergodic, i.e., the optimal average reward is independent of an initial state of the game, optimal strategies can be constructed using locally optimal moves.The algorithm reduces the game, using a potential transformation, to a canonical form in which the locally optimal moves are also globally optimal.The algorithm is pseudo-polynomial if the game has a constant number of probabilistic states, and otherwise it can be up to exponential in the number of probabilistic states.
Finally, for a related property called an almost-sure average reward property, where the aim is to achieve a certain average reward with probability 1 (as opposed to the expected average), one can use the approach presented in [1].As the algorithm was primarily designed to solve games with respect to a conjunction of such properties, we discuss the property and the approach in more detail in Sec.IV-G.

G. Branching-time Properties
The definition of properties in Def.6 allows one to reason about probabilistic and expected reward properties of games over linear time.Below, we extend the definition to branching time by combining properties in a PCTL * -like fashion.The resulting logic has been introduced and studied in [35], [2].
Note that, unlike in Def.6, here we only allow ρ = F * a for the reward operator R r x [ρ].The semantics for branchingtime properties is as shown in Tab.I, with formulas φ being interpreted over states s ∈ S of the game G, and formulas ψ being interpreted over paths λ ∈ Path G of the game.For completeness, the definitions not shown in Tab.I are given as follows: The problem of verification for branching-time properties is formulated in a way analogous to Problem 1.Given a game G, its state s and a branching-time property φ, the verification problem can be solved similarly to PCTL * model checking for MDPs [31].Intuitively, the solution is achieved by traversing the parse tree of φ in a bottom-up fashion.Iteratively, the innermost subformulas, which are either probabilistic LTL properties or total reward properties, are solved using techniques discussed in Sec.III-B-III-D, and replaced by new atomic propositions such that a state is labelled with the new proposition if and only if the answer to Problem 1 for the corresponding property is 'yes'.For full description, see [2].
On the other hand, the formulation of the strategy synthesis problem in Problem 2 does not extend to branchingtime properties in a straightforward way.For example, for a formula φ = P p1 [ψ 1 ]∧P p2 [ψ 2 ], the semantics implies that s |= φ if there exists a Player 1 strategy π 1 ∈ Π such that, for all Player 2 strategies σ ∈ Σ, it holds Pr π1,σ G,s (ψ 1 ) p 1 , and, at the same time, there exists a Player 1 strategy π 2 ∈ Π (possibly different than π 1 ) such that, for all Player 2 strategies σ ∈ Σ, it holds Pr π2,σ G,s (ψ 2 ) p 2 .This means that, even if the satisfaction s |= φ holds, there may not exist a single Player 1 strategy that is a witness to it.While the problem of strategy synthesis cannot be formulated in this way for the full logic from Def. 8, there exist branchingtime properties for which the problem can be formulated and is indeed interesting.For example, consider a formula φ = P p [F R x [F c a]] which states that, starting from a state s ∈ S, there exists a Player 1 strategy π 1 such that, with probability that satisfies the bound p and under any Player 2 strategy, the game reaches a state s ∈ S from which there exists a (possibly different) Player 1 strategy π 2 that guarantees that the expected total reward before reaching a state labelled with proposition a satisfies the bound x.Note that, if indeed s |= φ, a witness to this satisfaction is the strategy π 1 and the strategy π 2 does not need to be constructed.In [2], the author discussed the strategy synthesis problem for a fragment of branching-time properties in Def. 8. To be specific, an algorithm is described to synthesise a winning Player 1 strategy for branchingtime properties of the form P p [ψ] and R x [F * φ].Similarly to the verification case above, the algorithm is based on traversing the parse tree of the formula and constructing strategies for probabilistic LTL and total reward properties using techniques from Sec. III-B-III-D.
Remark 1: The semantics of branching-time properties can also be defined in a different way as follows.First, let the properties in Def.6 be interpreted over Markov chains rather than over states of a game.More formally, given a game G, its state s ∈ S, a Player 1 strategy π ∈ Π, and a Player 2 strategy σ ∈ Σ, we let G, s, π, σ |= P p [ψ] if and only if Pr π,σ G,s (ψ) p, and the semantics for the reward operator R r x [ρ] is defined in an analogous way.Given a state s ∈ S and a property φ = P p [ψ] or φ = R r x [ρ], the verification problem then asks for existence of a Player 1 strategy π such that, for all Player 2 strategies σ, it holds G, s, π, σ |= φ, and, likewise, the strategy synthesis problem aims to construct such a Player 1 strategy.This semantics of properties can be straightforwardly extended to branching-time properties in Def. 8.For example, for a formula φ = φ 1 ∧ φ 2 , it holds that G, s, π, σ |= φ if and only if G, s, π, σ |= φ 1 and G, s, π, σ |= φ 2 .Especially, note the difference between this semantics of a conjunction and the semantics given earlier in this section, after Def. 8.Both formulations of the verification and strategy synthesis problems can now be directly extended from simple linear-time and reward properties to branching-time properties.The problems are, however, very intricate.It has been shown in [48] that, already for PCTL, a fragment of the above branching-time properties with the probabilistic operator, the games are generally not determined.That means that there might exist states of the games from which neither of the two players has a winning strategy.Moreover, winning strategies may require (possibly infinite) memory and/or randomisation.Therefore, it makes sense to formulate the verification and strategy synthesis problems for specific subclasses of strategies, e.g., pure finite-memory or randomised infinite-memory strategies.In [48], the authors prove several complexity results for the verification problem with restricted classes of properties and strategies, including an undecidability result for a simple fragment of PCTL and finite-memory strategies.

H. Stochastic Games with Multiple Players
The definition of a stochastic game in Def. 1 can be extended to a multi-player stochastic game as follows.
Definition 9 (Multi-player stochastic game): A multi-player stochastic game is a tuple G = (S, (S 1 , . . ., S n , S p ), ∆), where S is a finite set of states partitioned into a set of probabilistic states S p and sets S 1 , . . ., S n of states of Players 1 to n, respectively.Probabilistic transition function ∆ : S × S → [0, 1] is such that for all states s ∈ 1≤i≤n S i it holds that ∆((s, s )) ∈ {0, 1} for every s ∈ S, and for probabilistic states s ∈ S p we have s ∈S ∆((s, s )) = 1.
A strategy for Player i ∈ {1, . . ., n} is defined as in Def. 4. A strategy for a coalition of players C ⊆ {1, . . ., n} consists of a set of strategies for the players in the coalition, one for each player.Definitions of linear-and branchingtime properties in Def.6 and 8 can be extended to consider coalitions of players C ⊆ {1, . . ., n} using syntax C φ.For branching-time properties, the resulting logic is called rPATL * ; for detailed definition of the semantics, see [35], [2].Intuitively, multi-player verification and strategy synthesis problems ask whether and how players of the coalition C can cooperatively guarantee satisfaction of the property φ.These problems reduce to the corresponding Problem 1 and 2 for the stochastic game with two players, where Player 1 represents the collective behaviour of the coalition C and Player 2 represents the remaining players {1, . . ., n} \ C.

IV. MULTI-OBJECTIVE GAME SOLVING
In this section, we discuss the problem of strategy synthesis, where the goal is to simultaneously satisfy a certain combination of properties of the form in Def. 6.

A. Problem Formulation
Definition 10 (Multi-objective property): A multiobjective property Φ is a conjunction of properties of the form P ≥p [ψ] and R r ≥x [ρ].The semantics of a multi-objective property involving n probabilistic properties and m reward properties, i.e., is defined over states s ∈ S of a game G as follows.It holds that s |= Φ if and only if there exists a Player 1 strategy π ∈ Π such that, for all Player 2 strategies and 1 ≤ i ≤ n, it holds Pr π,σ G,s (ψ i ) ≥ p i , and similarly, for all Player 2 strategies and 1 ≤ j ≤ m, it holds E π,σ G,s (rew(r j , ρ j )) ≥ r j .Note that, while such a conjunction is syntactically a branching-time property according to Def. 8, its semantics is different.In fact, the semantics of a multi-objective property is in line with the alternative semantics of branching-time properties discussed in Rem. 1.
Every property of the form R r ≤x [ρ] is equivalent to property R −r ≥−x [ρ] and, similarly, every property of the form P ≤p [ψ] is equivalent to P ≥1−p [¬ψ].Thus, the above definition of a multi-objective property covers conjunctions of all properties from Def. 6.While in Def. 10 we define multi-objective properties only as conjunctions (unlike, e.g., in [49], where any positive Boolean combinations are allowed), we address more complex combinations for some classes of properties further in this section.
Let Φ be a multi-objective property involving n probabilistic properties and m reward properties.For simplicity, we use r = (r 1 , . . ., r m ) to denote the vector of reward structures and r(s) = (r 1 (s), . . ., r m (s)), for every s ∈ S. Similarly, p = (p 1 , . . ., p n ) and x = (x 1 , . . ., x m ) denote the vectors of probability and reward bounds.Instead of Φ, we sometimes write Φ(p, x) to emphasise the corresponding bounds.We say that Φ(p, x), or the vector of bounds (p, x) for Φ, is achievable if and only if there exists a winning strategy for Player 1 that guarantees all properties in Φ with bounds p, x.The optimal achievable vectors of bounds are called Pareto vectors.
Definition 11 (Pareto set): Let Φ be a multi-objective property involving n probabilistic and m reward properties.A vector (p, x) ∈ R n+m is called a Pareto vector if the property Φ(p − ε, x − ε) is achievable for every ε > 0 and Φ(p + ε, x + ε) is not achievable for any ε > 0. Pareto set P is the set of all Pareto vectors for Φ.
The problems of multi-objective verification and strategy synthesis is formulated analogously to the single-objective case stated in Problem 1 and 2. Unlike in the single-objective case, optimal strategies might not exist.This is already true, for example, for precise-value games, where the objective is to achieve a precise value (related to probability or reward).Such a property can be expressed as a conjunction of two single-objective properties, and it has been shown in [46] that, in these games, a winning strategy may not exist for either of the two players.
In this section, we discuss existing solutions to multiobjective strategy synthesis depending on what type of properties are being combined.The solutions compute εapproximations of Pareto sets and the corresponding εoptimal strategies.Definition 12 (Pareto set approximation): For ε > 0, an ε-approximation of the Pareto set is a set of vectors Q such that for every (q, y) ∈ Q there exists a Pareto vector (p, x) ∈ P with (q, y) − (p, x) ≤ ε, and vice versa, for every Pareto vector (p, x) ∈ P there exists a vector (q, y) ∈ Q with (q, y) − (p, x) ≤ ε, where • is the Manhattan distance defined as the sum of componentwise differences.
≥0 is a set, and ≤ is the componentwise partial order on R m ≥0 .Given a stopping game G with multiple reward structures r and a multi-objective total reward property Φ(x), the approximation is computed for every state s ∈ S in |S| min , and ∆min is the smallest positive probability in G.

B. Multi-Objective Total Reward Properties
Here we discuss the strategy synthesis problem for multiobjective properties that only involve total reward properties of the form R r ≥x [C].The problem has been recently investigated in [49], [5].As discussed in Sec.IV-A, there exist games in which neither Player 1 nor Player 2 have winning strategies.Moreover, for stopping games with precise-value objectives randomised exponential memory strategies may be needed for Player 1 to win [46], and, for stopping games with general total reward objectives, randomised infinite memory strategies may be required [49].The problem of whether there exists a pure winning strategy for Player 1, in stopping games, is undecidable [49].
The ε-approximation of the Pareto set for a stopping game G and a multi-objective property Φ(x) can be computed using the iteration algorithm in Fig. 5. Intuitively, the set V * n (s) for a state s ∈ S computed in the n-th iteration of the algorithm is the downward closure of vectors of bounds achievable by Player 1, from s, in the finite time horizon of up to n steps.As Player 1 can randomise between successors of his/her states, the set V * n (s) for s ∈ S 1 is computed as a downward, convex closure of the union of V * n−1 (s ), for all s such that ∆(s, s ) = 1.For s ∈ S 2 , the bounds must be achievable for all successor states and, hence, we take the intersection.Finally, for probabilistic states s ∈ S p , we consider the sum weighted by the corresponding probabilistic distribution.
Given an ε-approximation of the Pareto set, the corresponding ε-optimal Player 1 strategy can be constructed as described in [5].To succinctly represent such strategies using a finite set of memory elements, the authors extend the definition of a strategy from Def. 4 to allow stochastic memory update, i.e., the memory update function is of type π u : M × S → D(M ) and the initial memory element function is π init : S → D(M ).In the construction, the vertices of approximation sets V * n (s), s ∈ S, act as memory elements and represent the vector of reward bounds that the strategy currently aims to achieve.The distributions in functions π u and π init are constructed so that the expected value of the next memory element is an ε-approximation of the target reward bounds x.Comparing to deterministic update strategies from Def. 4, with stochastic memory update strategies the memory required to win reduces from up to exponential to linear for stopping games with precise-value objectives [46] and from up to infinite to finite for stopping games with general total reward objectives [49].
Besides conjunctions of total reward properties, the authors in [49] discuss multi-objective properties constructed as a disjunction of total reward properties.It is shown that there exists a strategy achieving the disjunction if and only if there exists a strategy achieving a certain single-objective total reward property and thus pure memoryless strategies suffice to achieve a disjunction of total reward properties.Moreover, an algorithm for computing an ε-approximation of the Pareto sets for stopping games is presented.By combining the two algorithms for conjunctions and disjunctions, we obtain a solution for any positive Boolean combination of total reward properties for stopping stochastic games through first rewriting the combination into conjunctive normal form.
Example 5: For the autonomous car from Ex. 1, we consider the conjunction of total reward properties R r time ≤x1 [C] ∧ R renergy ≤x2 [C] that aims to compute a strategy that, simultaneously, guarantees that the expected total time is at most x 1 , and the expected total energy is at most x 2 .An εapproximation of the Pareto set computed using the algorithm in Fig. 5 for ε = 0.1 is shown in Fig. 6.For example, for x 1 = 3.4, x 2 = 5.7, the winning strategy generated by PRISM-games for the property is a stochastic memory update strategy with 155 memory elements that, in a nontrivial way, probabilistically switches between changing lane and honking for both hazards.The strategy can be viewed at [30].

C. Multi-Objective Probabilistic Reachability Properties
Using the reduction of probabilistic reachability properties to total reward properties described in Sec.III-B, the iterative algorithm in Fig. 5 can be adapted to compute εapproximations of Pareto sets for any stopping stochastic game with a multi-objective property that involves only prob- abilistic reachability properties.Disjunctions of probabilistic reachability properties can be addressed in a similar manner, using the reduction from Sec. III-B with the algorithm for disjunctions of total reward properties from [49].Hence, stopping games with any positive Boolean combination of probabilistic reachability properties can be handled.

D. Multi-Objective Probabilistic LTL Properties
For stopping stochastic games, the strategy synthesis problem for multi-objective properties involving only probabilistic LTL properties P ≥pi [ψ i ], 1 ≤ i ≤ n, has been discussed in [5].In this case, the solution is to construct a deterministic Rabin automaton for each ψ i and then build a synchronous product of all the automata and the original game G, with a new terminal state which is entered after G enters any of its terminal states.Since G is stopping, it indeed suffices to analyse satisfaction of formulas upon reaching a terminal state.The problem then reduces to solving the product game with respect to a multi-objective reachability property.Finally, since the resulting product game is again stopping, we can apply the approach from Sec. IV-C.It follows that, in fact, we can solve any positive Boolean combination of probabilistic LTL properties for stopping stochastic games.
For general stochastic games, the strategy synthesis problem for multi-objective probabilistic LTL properties remains open.
Example 6: Consider the conjunction P ≥p1 [F succ] ∧ P ≥p2 [G ¬acc] of the reachability and safety LTL properties for the autonomous car example.An ε-approximation of the Pareto set computed using the algorithm in Fig. 5 for ε = 0.001 is shown in Fig. 7.For example, for p 1 = 0.7, p 2 = 0.2, there exists a pure memoryless winning strategy (in PRISM-games generated as a stochastic memory update strategy with 25 memory elements) that brakes in a traffic jam and honks when approaching a pedestrian.The strategy can be viewed at [30].

E. Mixed Multi-Objective Properties and Compositional Strategy Synthesis
From the sections above it follows that, for stopping stochastic games, we can ε-approximate Pareto sets for any positive Boolean combination of total reward, probabilistic reachability and probabilistic LTL properties.In [50], the authors also discuss a different approach to multi-objective strategy synthesis through composition.First, a composition of stochastic games is defined, in a way that preserves the identity of Player 1.Here, the component stochastic games G i , i ∈ I = {1, . . ., N }, that are being composed are considered to have labels on Player 1 and Player 2 transitions referred to as actions and, in the composed game G = i∈I G i , component games synchronise on actions.Properties of the component games, as well as the composed game, are then defined over sequences of actions called traces, rather than over paths as in Def. 6.Under the assumption that the component games are compatible, i.e., all actions of Player 1 in each composite game are enabled and fully controlled by Player 1, the Player 1 strategy π = i∈I π i for G that is a composition of Player 1 strategies π i for component games G i preserves all properties.More precisely, if strategies π i guarantee a (possibly multi-objective) property Φ i in component games G i , then the composed strategy π guarantees property Φ in G, where Φ is any property for the composed game that can be derived from Φ i using, for example, assume-guarantee rules in [51].In particular, Player 1 of different component games can cooperate to achieve a common goal: if in one component game Player 1 guarantees a property Φ 2 under some assumption Φ 1 on the environment, i.e., Φ 1 ⇒ Φ 2 , and Player 1 in a different component game ensures Φ 1 , then the composition satisfies property Φ 2 .
The framework for compositional strategy synthesis presented in [50] first computes an ε-approximation Q of the Pareto set for Φ based on ε-approximations Q i of Pareto sets for Φ i .For a chosen achievable vector of bounds (p, x) for Φ, Player 1 strategies π i are synthesised for component games G i that achieve Φ i (p i , x i ), where (p i , x i ) are the bounds obtained by projecting (p, x) from Q to Q i .The composed strategy π = i∈I π i then achieves Φ(p, x).Note that in order to take full advantage of assume-guarantee rules, we would need to be able to synthesise strategies for arbitrary Boolean combinations of properties.Example 7: An example illustrating the compositional approach to multi-objective game solving can be found in [52].Here, we present results obtained using reductions to total reward properties as discussed earlier in this section.We combine various properties for the autonomous car from Ex. 1 in the following conjunction: An ε-approximation of the Pareto set computed using the algorithm in Fig. 5 for ε = 0.01 is shown in Fig. 8.For example, for p 1 = 0.7, p 2 = 0.13, x 1 = 5.7, the winning strategy for the property generated by PRISM-games is a stochastic memory update strategy with 775 memory elements that, in a non-trivial way, probabilistically chooses between all three reactions for a traffic jam and between honking and braking for a pedestrian.The strategy can be viewed at [30].

F. Multi-Objective Discounted Reward Properties
To the best of our knowledge, properties that combine multiple discounted reward properties have only been addressed for the subclass of stochastic games with one player and probabilistic states, i.e., MDPs [53].Note that the reduction from discounted reward to total reward properties discussed in Sec.III-E alters the transition probabilities of the game depending on the discount factor β ∈ (0, 1).It follows that, using this reduction, the iterative algorithm in Fig. 5 can be applied to compute ε-approximations of Pareto sets for any stochastic game with a Boolean combination of discounted reward properties with the same discount factor.

G. Multi-Objective Average Reward Properties
For multi-objective synthesis with multiple average reward properties, we cannot apply the approach presented in Sec.IV-B.The reason is that the algorithm in Fig. 5 approximates the Pareto set in a finite number of iterations by combining the achievable values of successive states.However, infinite horizon properties such as the expected average reward disregard all transient behaviour.Preliminary results for multi-objective average reward synthesis have been presented in [1], where the authors consider conjunctions of a special case of the (single-objective) expected average reward properties, almost sure average reward properties R r =1,≥x [S], that require that the average reward achieved over a path is above a given bound with probability 1. Formally, given a game G, its state s ∈ S, a reward structure r and Player 1 and Player 2 strategies π, σ, respectively, the relation G, s, π, σ |= R r =1,≥x [S] holds true if and only if Pr π,σ G,s ({λ ∈ Path s | rew(r, S)(λ) ≥ x}) = 1.Note that this implies G, s, π, σ |= R r ≥x [S], but the reverse implication is not necessarily true, see [1] for an example.
The authors show that synthesis for multi-objective properties of this type reduces to synthesis for multi-objective expected energy properties.Intuitively, given a reward structure r possibly assigning both positive and negative values to states, the expected energy property requires that, for every state s ∈ S of the game, there exists a bound x such that the expected total reward obtained starting from s in k steps is at least x for all k ≥ 0. Only finite-memory (possibly stochastic memory update) strategies are considered and it holds that every Player 1 strategy that satisfies the expected energy property also satisfies the almost sure average reward property over the same reward structure with bound 0, and hence the same applies for ε-optimal strategies.As R r =1,≥x [S] is equivalent to R r−x =1,≥0 [S], the above property can be adapted for any almost sure average reward property.
Given a game G, multiple reward structures r (allowing both positive and negative reward values), a vector of bounds x for the almost sure average reward properties and ε > 0, the authors design an algorithm that terminates with a finitememory stochastic update ε-optimal strategy if the vector x is achievable.The algorithm uses value iteration to compute ε-optimal strategies for the corresponding multi-objective expected energy property.
Finally, the authors generalise the almost sure average reward property to a ratio reward property R , and this can be straightforwardly extended to multi-objective properties using vectors.

V. IMPLEMENTATION
Software tools for analysis of games include the following.Among the tools that focus on subclasses of stochastic games, QUASY [54] offers synthesis of strategies for MDPs and non-stochastic games with mean-payoff objectives.Methods for expected ratio reward objectives are implemented in [55].MultiGain [56] solves MDPs with multi-objective mean-payoff properties.PRISM [21] performs verification for MDPs with single-and multiobjective properties, namely probabilistic LTL and expected total reward.MOCHA [57] is a tool for verification and strategy synthesis for non-stochastic games with alternatingtime temporal logic (ATL) specifications, as well as for automatic checking of assume-guarantee queries.
For stochastic games, GIST [58] offers support for qualitative verification, i.e., probability 1 or non-zero probability, of stochastic games with ω-regular properties.GAVS+ [59] includes implementation of value and policy iteration for stochastic games with reachability properties.
Finally, for various extensions of stochastic games briefly discussed in Sec.VII below, EAGLE [60] and PRA-LINE [61] analyse Nash equilibria for non-stochastic games.Uppaal Stratego [62] performs strategy synthesis for realtime systems against quantitative properties.The TuLiP toolbox [63] provides synthesis for linear (continuous) systems with GR(1) specification.
In comparison, most of the algorithmic solutions presented in this paper including single-and multi-objective, as well as compositional strategy synthesis problems for stochastic games and games with multiple players, have been implemented in the open-source tool called PRISM-games [19], [20], which can be downloaded from [64].PRISM-games can be used to model, verify, solve and simulate stochastic games with complex properties.It has been developed as an extension of the probabilistic model checker PRISM [21] and takes advantage of PRISM's modelling and specification language, as well as the existing user interface and simulator.
The original version, PRISM-games 1.0 [19], allows one to model multi-player stochastic games as introduced in Def. 9 using modules with synchronising actions.The recently released version, PRISM-games 2.0 [20], adds a compositional modelling approach to facilitate the compositional strategy synthesis discussed in Sec.IV-E.
The specification language is based on rPATL [35], a fragment of the branching-time logic in Def. 8 that also allows specification of properties for coalitions of players as discussed in Sec.III-H.In particular, rPATL subsumes single-objective probabilistic reachability, a restricted class of probabilistic LTL properties, and total reward properties with ρ = F * a, and their Boolean combinations.In the first version, PRISM-games supports rPATL formulas, numerical queries and precise-value operators P =p , R r =x [46].The new version adds several single-objective properties, namely total reward properties with ρ = C for stopping games, average reward and ratio properties for a special class of games called controllable multichain games (for details, see [52]), and almost sure average reward and ratio properties.Besides single-objective properties, PRISM-games 2.0 allows multiobjective properties expressed as Boolean combinations of the same type of reward properties, except for the almost sure average and ratio reward properties for which only conjunctions are supported.
From the implementation point of view, PRISM-games builds on the Java-based engine of PRISM and handles games in an explicit-state fashion.In the multi-objective strategy synthesis, a feature introduced in the new version of the tool, the computation relies on the Parma Polyhedra Library [65] for symbolic manipulation of convex sets during ε-approximate computation of Pareto sets.

VI. CASE STUDIES
Stochastic games have been used to model and analyse various control and networked systems.Here we list a set of examples that have been evaluated using PRISM-games, and offer their intuitive description.As mentioned in Sec.V, tools such as GIST and GAVS+ provide partial support for stochastic games, but have only been used with small, illustrative examples.For more information, we refer the interested reader to the indicated publications and references therein.Experimental evaluation of some of the examples can also be found in [19], [20].A more exhaustive list of examples is maintained in the publications section of the PRISM-games website [64] and in the database of PRISM and PRISM-games case studies [66].
Microgrid demand-side management [2]: The example models a decentralised energy management algorithm for smart grids.The system consists of a set of households that generate loads of various duration.Each household follows a simple algorithm to execute a load if the current energy cost is below a pre-agreed limit, otherwise it only executes the load with a pre-agreed probability.The energy cost to execute a load for a single time unit is the number of loads currently being executed in the grid.The algorithm is analysed with respect to the expected load per cost unit for a household, formulated as a single-objective total reward property.
Collective decision making for sensor networks [2]: Sensor networks comprise of a set of low-power, autonomous devices that often must collaborate to achieve a goal.Here, a set of sensors is considered with the goal to agree on a target with the highest quality using a decentralised decision algorithm.In the algorithm, a sensor can probabilistically change its preferred target either based on its own exploration of available targets or based on communication with other sensors.The proposed decision procedure is analysed with respect to the speed of convergence, formulated as a total reward property, and robustness, i.e., the ability to recover from a bad decision to a good one, formulated as a branching-time property with nested probabilistic operators.
Reputation protocol for user-centric networks [2]: Usercentric networks are designed to encourage users to cooperate in sharing resources and services in order to, for example, provide connectivity in a mobile ad-hoc network.The case study presents a general model consisting of providers offering services to requesters.A requester chooses a provider and submits a request.The provider decides whether to accept the request based on a trust level towards the requester that is dependent on his/her reputation across all users in the network.If the request is accepted, the cost of the service is negotiated.After service delivery, the requester chooses whether to pay the cost or not, thus increasing or decreasing his/her trustworthiness, respectively.Using expected total reward properties, the maximum number of unpaid services that the requester can obtain is computed as well as the minimum price at which the requester can buy a particular number of services.Strategy synthesis is used to uncover possibly undesirable optimal behaviour of the requester in the latter case, and an adjustment to the protocol is suggested to improve it.
Futures market investor [66]: An investor in a futures market decides when to invest in shares of a specific company.The decision can be made on the first day of any month collecting the payoff one month later.The market value of shares changes probabilistically over time within a bounded range and the distribution changes based on the current value.Moreover, the market can temporarily decide to bar the investor from making the investment.The corresponding stochastic game is analysed to compute the maximum expected payoff that the investor can guarantee for various initial share values and the optimal strategies are discussed.
Human-in-the-loop UAV mission planning [3]: An unmanned aerial vehicle (UAV) is performing road network surveillance, reacting to inputs from a human operator.The UAV acts autonomously in fulfilling most of the piloting functions, such as selecting most of the waypoints that comprise the route, and flying the route.The operator primarily performs sensor tasks at waypoints but may also pick a road for the UAV at waypoints.The optimal UAV piloting strategy depends on mission objectives, e.g., safety, reachability, coverage, and operator characteristics, i.e., workload, proficiency, and fatigue.For the stochastic game modelling the situation, the minimum expected time of completing the temporal mission of covering a set of waypoints is computed.Moreover, a multi-objective property is considered to analyse the trade-off between the completion time and the number of visits to restricted operating zones.
Autonomous urban driving [5]: An autonomous car is considered that drives through an urban environment and reacts to hazards such as pedestrians, obstacles, and traffic jams.Note that this case study serves as a motivation for our illustrative example presented in Ex. 1.Here, the car does not only decide on the reactions to hazards, but also chooses the roads to take in order to reach a target location.The presence of hazards, as well as the effects of reactions, may differ between roads.Through multi-objective strategy synthesis, strategies with optimal trade-off between the probability of reaching the target location, the probability of avoiding accidents and the overall quality of roads on the route, are identified.
Aircraft power distribution [1]: An aircraft electrical power network is considered, where power is to be routed from generators to buses through controllable switches.The generators can exhibit failures and switches have delays.The system consists of several components, each containing buses and generators, and the components can deliver power to each other.The network is modelled as a composition of stochastic games, one for each component.Compositional strategy synthesis is applied to find strategies with good trade-off between uptime of buses and failure rate.The prop-erty is modelled as a conjunction of ratio reward properties.
Self-adaptive software architectures [67], [68], [69], [70]: Software systems dealing with distributed applications in changing environments normally require human supervision to continue operation in all conditions.Self-adaptive software architecture is a response to these demands, where the system automatically adapts its structure and behaviour according to changes in real time.Both single-and multi-objective verification of multi-player stochastic games is applied to analyse three self-adaptive software architectures, namely, the impact of communication topology for collections of fully cooperative systems defending against an external attack, the infrastructure for a news website, and an adaptive industrial middleware used to monitor and manage sensor networks in renewable energy production plants.
DNS Bandwidth Amplification Attack [9]: The Domain Name System (DNS) is an Internet-wide hierarchical naming system for assigning IP addresses to domain names, and any disruption of the service can lead to serious consequences.A notable threat to DNS, namely the bandwidth amplification attack, where an attacker attempts to flood a victim DNS server with malicious traffic, is modelled as a stochastic game.Verification and strategy synthesis is used to analyse and generate countermeasures to defend against the attack.

VII. CONCLUSION
In this work, we have overviewed the existing body of knowledge and algorithmic solutions to the verification and strategy synthesis problems for stochastic games.We addressed a large class of properties, from probabilistic linear-time through various expected reward properties, to their branching-time and multi-objective combinations.As demonstrated through the case studies, the techniques can be used to analyse various control systems, for example, in network management, autonomous and human-in-the-loop planning, and security attack countermeasures.Evaluation of such systems can be achieved using the practical implementation of the algorithms in PRISM-games.Though several of the algorithms have have high computational complexity, the range of case studies that have been tackled using stochastic games is encouraging, and we anticipate that by adapting implementation techniques that have been successful in probabilistic verification, for example symbolic methods and Monte carlo sampling, will allow us to broaden the applicability even further.
While some of the open questions have already been identified in the previous sections, the following extensions of games pose further challenges.
Concurrent games, where players choose their moves concurrently rather than in turns, comprise the original games with probability introduced by Shapley [11], and they are a natural extension of stochastic games discussed here.For an overview of existing techniques, see, e.g., [12], [13], [14].
Partial-observation games, where the current state of the game is only partially observed (by one or both of the players), represent another widely studied model of games [17], [71].Recent results include [72], [73].Besides concurrent and stochastic games, they subsume models such as partially observable Markov decision processes (POMDPs) [74] and probabilistic automata [75].
Finally, one can consider nonzero-sum games, or games with equilibria, where the objectives of players are not necessarily dual.For an overview of results, see, e.g., [81], [82] and references therein.

Problem 1 (
Verification): Given a stochastic game G with an initial state s ∈ S, a set of atomic propositions AP and a labelling function L, and a property over AP from Def. 6, i.e., φ = P p [ψ] or φ = R x [ρ], does it hold that s |= φ? Problem 2 (Strategy synthesis): Given a stochastic game G with an initial state s ∈ S, a set of atomic propositions AP and a labelling function L, and a property over AP from Def. 6, i.e., φ = P p [ψ] or φ = R x [ρ], construct a Player 1 strategy π ∈ Π (if it exists) that is a witness to the satisfaction s |= φ.

Fig. 3 :
Fig. 3: Optimal values obtained for step-bounded numerical query P max=?[F ≤k succ] for the game introduced in Ex. 1.

Fig. 4 :
Fig. 4: Value iteration algorithm for the total reward numerical query R r max=?[C].

Fig. 5 :
Fig. 5: Iterative computation of an ε-approximation of the Pareto set for a multi-objective total reward property.Here, x ∈ R ≥0 is a real number, x ∈ R m ≥0 is a vector, X ⊆ R m≥0 is a set, and ≤ is the componentwise partial order on R m ≥0 .Given a stopping game G with multiple reward structures r and a multi-objective total reward property Φ(x), the approximation is computed for every state s ∈ S ink = |S| + |S| • ln(ε•(n•M ) −1 )

Fig. 6 :
Fig. 6: An ε-approximation of the Pareto set for multi-objective total reward property R r time ≤x 1 [C] ∧ R renergy ≤x 2 [C] for the game introduced in Ex. 1.

Fig. 7 :
Fig. 7: An ε-approximation of the Pareto set for multi-objective probabilistic LTL property P ≥p 1 [F succ] ∧ P ≥p 2 [G ¬acc] for the game introduced in Ex. 1.

TABLE I :
Semantics of properties defined in Def. 6.Here, G is a stochastic game, s ∈ S is its state, L is a labelling function over a set of atomic propositions AP, Π and Σ are the sets of Player 1 and Player 2 strategies, respectively, λ = λ0λ1 . . .∈ PathG is a path, r is a reward structure on G and β ∈ (0, 1) is a discount factor.