A learning-based synthesis approach of reward asynchronous probabilistic games against the linear temporal logic winning condition

The traditional synthesis problem is usually solved by constructing a system that fulfills given specifications. The system is constantly interacting with the environment and is opposed to the environment. The problem can be further regarded as solving a two-player game (the system and its environment). Meanwhile, stochastic games are often used to model reactive processes. With the development of the intelligent industry, these theories are extensively used in robot patrolling, intelligent logistics, and intelligent transportation. However, it is still challenging to find a practically feasible synthesis algorithm and generate the optimal system according to the existing research. Thus, it is desirable to design an incentive mechanism to motivate the system to fulfill given specifications. This work studies the learning-based approach for strategy synthesis of reward asynchronous probabilistic games against linear temporal logic (LTL) specifications in a probabilistic environment. An asynchronous reward mechanism is proposed to motivate players to gain maximized rewards by their positions and choose actions. Based on this mechanism, the techniques of the learning theory can be applied to transform the synthesis problem into the problem of computing the expected rewards. Then, it is proven that the reinforcement learning algorithm provides the optimal strategies that maximize the expected cumulative reward of the satisfaction of an LTL specification asymptotically. Finally, our techniques are implemented, and their effectiveness is illustrated by two case studies of robot patrolling and autonomous driving.


INTRODUCTION
Reaction system synthesis is a technique that automatically explores how a system satisfies a specific specification (task) (Bloem et al., 2012). In recent years, with the rapid development of the intelligent industry, it has been widely used in robot patrolling, intelligent logistics, and intelligent transportation. However, it is still difficult to find a practically feasible synthesis algorithm and generate the optimal system. The construction of a reactive system usually needs to produce outputs for inputs to fulfill the requirement that is typically described by a Linear Temporal Logic (LTL) formula. Constructing a correct reactive system needs to generate outputs for the inputs to fulfill some given specifications (Buchi & Landweber, 1990). This construction process can be graphically modeled as a two-player game between the system (outputs) and the environment (inputs) (Rabin, 1972). The goal of the game is to synthesize strategies for the player to satisfy a given LTL specification. The system win indicates that a given specification is satisfied. These games can be solved algorithmically, i.e., one can determine which player wins the game and produce a winning strategy; the winner is guaranteed to have a strategy that is memoryless or only requires a finite memory.
Even though the synthesis of LTL specifications has been extensively studied, there are still some challenges. This is mainly because the system cannot respond in time or chooses the wrong behavior when interacting with the complex, changeable, and uncertain environment. In this case, with the further study of the system synthesis problem, the probability becomes the mainstream method to model and analyze reactive systems. Not only that, but probability theory has wider applications in the field of control (Ren, Zhang & Zhang, 2019;Zhang & Wang, 2021;Liu, Zhang & Yue, 2021). Recently, probabilistic synthesis is proposed and extensively studied, e.g., (Filar & Vrieze, 2012;Church, 1963;Shapley, 1953;Chatterjee, Henzinger & Jobstmann, 2008;Kwiatkowska & Parker, 2013;Neyman, Sorin & Sorin, 2003;Nilim & El Ghaoui, 2005;Lustig, Nain & Vardi, 2011;Dräger et al., 2014). With the widespread attention to probabilistic synthesis, there is a growing interest in the problem of expected reward in the probabilistic environment in recent years. In a probabilistic environment, the value of a strategy for the system is the maximal reward of a play induced by this strategy, and the goal of the system is to maximize this value.
This study focuses on the synthesis problem of reactive systems and transforms it into the problem of probabilistic games with both LTL winning conditions and a reward mechanism. First, a learning-based approach is designed to motivate the system to satisfy the winning condition and generate strategies. In our work, uncertainty is considered both in the environment properties and in the system behaviors. Meanwhile, the reward properties are considered in our model. Based on this, this study establishes a probabilistic model with an asynchronous reward mechanism, which is referred to as reward asynchronous probabilistic game (RAPG). Then, based on the reinforcement learning method, algorithmic incentive systems are developed to win the game and generate corresponding strategies. To formulate synthesizing strategies for each player to maximize the expected cumulative rewards for satisfying the LTL objectives, this study first encodes reachability properties to obtain some sets of states that satisfy the specific requirement, then shrinks the state space, and constructs an asynchronous reward mechanism for players according to the winning condition. Finally, the asynchronous reward mechanism is combined with reinforcement learning to learn strategies that satisfy the given specifications.
To the best of our knowledge, there is no existing work on the strategy synthesis of RAPGs based on reinforcement learning. This study focuses on calculating the expected cumulative reward for a play satisfying the winning condition under each state while maximizing the expected rewards against an adversarial probabilistic environment. The contributions of this article are summarized as follows: An asynchronous reward mechanism is designed to motivate the system to satisfy specific requirements, and a novel probabilistic model is proposed; Reachability properties are analyzed and encoded, which tremendously shrink the state space of the game; A learning-based approach is proposed to compute the maximum expected cumulative reward satisfying LTL specifications and generate corresponding strategies.
The rest of this article is organized as follows. In "Related work", an overview of the related work is given. The background definitions and notations are provided in "Preliminaries". "Synthesis problem through reward mechanism" introduces the definition of RAPGs, the synchronous reward mechanism, and the formalization of the research problem. The learning-based synthesis algorithm is introduced in "Synthesis algorithm based on reinforcement learning". The applicability of our algorithm is verified by using two examples in "Case study". Finally, "Summary and future work" concludes this article and discusses future research work.

RELATED WORK
Reactive system synthesis under an uncertain environment is widely used in computer science, engineering, and economics, e.g., autonomous driving and robot rescue operations (Sutton & Barto, 2018). Reactive systems have inherent complexity due to continuous interactions with the external environment. Meanwhile, the uncertainty of the environment brings a great challenge to system synthesis. The existing works on reactive synthesis problems mainly focus on environmental uncertainty, and the solution is to reduce the computational complexity of the synthesis algorithm (or optimize the synthesis algorithm) (Bloem et al., 2012;Buchi & Landweber, 1990;Harding, Ryan & Schobbens, 2005) and to repair unrealized specifications (Hunt & Johnson, 2000;Könighofer, Hofferek & Bloem, 2013;Kuvent, Maoz & Ringert, 2017;Maoz, Ringert & Shalom, 2019). In general, a requirement is defined as a contract about the assumption (input) on the environment behaviors and the guarantee (output) of the behavior of the system in a reactive system. That is, given the assumption (input) of the behavior of the environment, the system behavior is always guaranteed (output) to satisfy the specified specifications (Zhao et al., 2022). Unrealizable specifications can be understood as if the environment satisfies all assumptions while forcing the system behavior to violate some guarantees. The efficient synthesis algorithm and the method of repairing the unrealizable specifications do not directly analyze the influence of uncertainty on system synthesis. Different from these two methods, our work focuses on motivating systems to satisfy a given requirement by using the asynchronous reward mechanism.
As for probabilistic synthesis, most systems are modeled as reachability games, includes 2 1 2 -player games (Nilim & El Ghaoui, 2005;Svorenová & Kwiatkowska, 2016), concurrent games (Neyman, Sorin & Sorin, 2003;De, Henzinger & Kupferman, 2007), etc. Nilim & El Ghaoui (2005) proposed to model the complex system as a 2 1 2 -player game. In general, the state space of a 2 1 2 -player game involves a class of player states and a set of probabilistic states. Kwiatkowska, Norman & Parker (2019) introduces turn-based probabilistic timed multi-player games. Current games are also often used to model systems (Hasanbeig et al., 2019;Kwiatkowska et al., 2021). The characteristic of this game is that a state transition is performed through two actions taken by the player separately but independently. The games considered in this article differ from both 2 1 2 -player games and concurrent games. In addition, Almagor, Kupferman & Velner (2016) proposed mean-payoff Markov Decision Processes (MDPs) with a parity winning condition to find a strategy that minimizes the expected cost of a play against a probabilistic environment. Our work considers modeling a reactive system as a RAPG. In each round of the RAPG, state transitions are determined alternately by two players who choose their actions and transitions. Meanwhile, both players will obtain corresponding rewards by choosing actions and transitions. Probabilistic reachability is an important property of APGs, which help to deeply study the calculation of the winning probability of the system. Another important property of PAPG that will be concerned in our work is the expected reachability. It allows using rewards and costs to model, e.g., rewards for robots completing tasks, and safety drive for autonomous driving cars. Considering the reward property, this work is interested in calculating the cumulative rewards, i.e., the sum of the rewards obtained when the system runs. Based on the calculation of the reward achieved on all runs of the system, the expected cumulative rewards can be obtained. Thus, the value of a strategy for the system is the maximal cumulative reward of a play induced by this strategy, and the goal of the game is to maximize this value by motivating the system to win.
Reinforcement learning (RL) designs algorithms to learn the optimal strategy which maximizes/minimizes the expected reward through interactions with the complex environment (Sutton & Barto, 2018). Typically, MDPs play a critical role in RL because of their unique ability to describe the time-independent state transition property. Generally, there are two types of RL algorithms: model-free algorithms and model-based algorithms (Filar & Vrieze, 2012;Hasanbeig et al., 2019;Lavaei et al., 2020;Huh & Yang, 2020;Fu & Topcu, 2014;Brázdil et al., 2014;Puterman, 2014). Hasanbeig et al. (2019) presents an approach to design optimal control strategies for Markov decision processes with unknown behavior by the model-free RL algorithm. This approach generates traces that satisfy specific LTL specifications with the maximized probability and returns the maximum expected reward. Many RL algorithms like Q-learning can be regarded as model-free algorithms. By updating the values of each state-action pair, these algorithms can directly learn the action-value function. Furthermore, the safe operation problem of the system can be solved by a model-free safety specification algorithm (Huh & Yang, 2020). The method is to learn the maximum probability of safe operations by combining probabilistic reachability with a safe RL algorithm. Model-based reinforcement learning algorithms are also often used to design strategies, such as in Fu & Topcu (2014), Brázdil et al. (2014), to synthesize strategies that maximize the satisfaction probability for Markov decision processes (MDPs). Most of these studies model the environment as an MDP and an extended MDP. Brázdil et al. (2014) proposed the expected total reward for discrete-time Markov chains (DTMCs) by solving a set of equation systems and for MDPs by solving a linear program (Filar & Vrieze, 2012). In our work, formal methods are combined with learning-based methods to explore the reward properties of probabilistic synthesis. In particular, for APGs, this work focuses on RL algorithms to compute the expected cumulative reward of the system winning. Precisely, the RL method is used to learn the strategy to motivate the system to win and compute the expected cumulative reward for each player.

PRELIMINARIES
This section briefly introduces the definitions used in our article, including the specification language, games, and a modal l-calculus over asynchronous probabilistic games. LTL is taken as the desired specification language, and LTL specifications are first introduced.

LTL
LTL has been increasingly popular as a tool to describe specific requirements when synthesizing strategies for reactive systems.
Syntax. Given a finite and non-empty set V of atomic propositions, the arbitrary proposition in V is denoted as p. Given a position, Boolean variables have a unique truth value as True or False. Note that temporal operators usually have two conventional notations, either X, G, F, U or ◯, ◻, ♢, U. In this article, the former is followed.
An LTL formula w is defined inductively according to the following grammar: p ::¼ Truej:wjw 1 _ w 2 j Xwjw 1 Uw 2 where the Boolean constants True and False can be denoted by formulas "⊤" and "⊥", respectively; ¬ and ∨ are the logic connectives negation and disjunction; X and U are the temporal operators next and until. If w is a LTL formula, :w is also a LTL formula. In addition, logic connectives conjunction (∧), implication (⇒), and equivalence (5) can be defined as w 1^w2 :ð:w 1 _ :w 2 Þ; w 1 ) w 2 :w 1 _ w 2 , and w 1 , w 2 ðw 1 ) w 2 Þ ðw 2 , w 1 Þ, respectively. Additional temporal operators such as eventually (F) and always (G) are derived as Fw >Uw and Gw :F:w.
Semantics. Given a finite set of Boolean variables V, an infinite sequence p ¼ p 0 p 1 Á Á Á 2 ð2 V Þ x is defined as a computation. Denote the LTL formula that w holds at position i ! 0 of p as p; i w, which is the satisfaction relation. The semantics of LTL formulas are formally defined as follows: p; i p if and only if p 2 p i p; i :w if and only if p; i⊭ w p; i w 1 _ w 2 if and only if p; i w 1 or r; i w 2 p; i X w if and only if p; i þ 1 w p; i w 1 Uw 2 if and only if there exists k ! i such that p; k w 2 and for all i j < k, p; j w 1 Intuitively, Xw means that w holds (or is true) in the next position (or the next "step") in the computation; w 1 Uw 2 means that w 1 holds until w 2 becomes True. The computation p satisfies w if p; 0 w, which is denoted as p w. If w satisfies p in every position of the computation, it means that p satisfies Aw; if w will be satisfied at least once in the future, then the computation p satisfies Fw. Besides, this article defines p; i GFw if w will be true infinitely times in the computation, and p; i FGw if w will eventually be continuously true start from some position in the computation. Meanwhile, GFw is usually used to denote the goal of systems or environments that need to be satisfied.

Asynchronous probabilistic games
The interaction between the system and the environment is transformed into an asynchronous probabilistic game, which helps to analyze the uncertainty when the system interacts with the environment. Now, the definitions of the asynchronous probabilistic games are introduced below.
Definition 1. An Asynchronous Probabilistic Game (APG) is defined as a tuple G ¼ hV; A ct ; V; P e ; P s ; Li, where V is a finite set of atomic propositions, and V is the set of states on the game arena.
where V e and V s are the sets of environment states and system states, respectively. P e : V e Â A ct ! DistðV s Þ is a transition function of the environment such that Pðv e ; aÞðv s Þ is the probability to transit from environment state v e to system state v s on taking action a, where DistðV s Þ is a discrete probability distribution over V s . P s : V s Â A ct ! DistðV e Þ is a transition function of the system such that Pðv s ; aÞðv e Þ is the probability to transit from system state v s to environment state v s on taking action a, where DistðV e Þ is a discrete probability distribution over V e . L : V ! 2 V is a labeling function, and L(v) is a set of atomic propositions that holds in v, where v ∈ V.
In a game, the steps are executed alternatively by the environment and the system. Given an APG G, a finite (or an infinite) sequence Let v 0 be the initial state of p and Å be the set of all plays of G. For a state v ∈ V, Å v is the set of plays with v as the initial state.
Given a game G and a LTL formula w, this work uses w to denote the winning condition of the game G. Let p ¼ v 0 ; v 1 ; Á Á Á be a play, p is winning for the system under a given winning condition w if p is a finite play and the last state v n is the environment state in which there is no action a 2 A ct such that P e ðv n ; aÞðv s Þ > 0, or p is an infinite play and p satisfies the winning condition w; otherwise, it is said that p is winning for the environment.
For an APG G and a state c of the set V, an action a 2 A ct is in state c if there is a state d ∈ V such that P e ðc; aÞðdÞ ¼ 1. This article denotes the set of actions in c as A ct ðcÞ.
Assume that Next s and Next e are the sets of finite plays with the last state in V s and V e , respectively.
For an APG G, a strategy for the system of G is a function f : Next s ! A ct , and f ðv 0 . . . v n Þ 2 Actðv n Þ is the next action to choose by the system. Similarly, a strategy can be defined for the environment. This article denotes the sets of the system strategies and environment strategies of G as F s and F e , respectively. A strategy is memoryless if it relies only on the current state of the play and is not related to the history of the play. Formally, for any p 1 and p 2 in Next s (or Next e ) and any state in V s (or V e ), we have f ðp 1 vÞ ¼ f ðp 2 vÞ. In addition to memoryless strategies, there are also history-dependent strategies and finite-memory strategies. In this article, it is sufficient to consider only memoryless strategies (Kwiatkowska & Parker, 2013).
Given a play p ¼ v 0 v 1 . . . v i . . . that follows a system (or environment) strategy f, let each finite prefix s ¼ v 0 v 1 . . . v i 2 Next s (or Next e ), and we have Pðs; f ðsÞÞðv iþ1 Þ > 0. Let w be a state proposition and v be a state in V, this article denotes the probability that the play's initial states satisfies w and follow strategy f as Pr f ðv wÞ.
For a game G, let T V be a set of states that satisfy Boolean expression w (or a state proposition in general), i.e., there is a state v ∈ T such that w is true of v. This article uses T to denote a set of states that satisfy w, if w is the winning condition. T is the set of target states in game G, where any play starting from any state in T satisfies the winning condition w.
Next, this article defines the reachability probability property over the game G. Before the definition is given, the reachability and fairness properties of T are explained. Given an LTL formula Fw, it means that w holds in some state of the computation. The reachability property of T is that some states in B occur in the computation of the game. An LTL formula GFw means that w holds for infinite time in the computation. Then, the fairness property of T is that some states in set T occur infinite times in the computation.
Definition 2. Given an APG G, f is a strategy, w is a winning condition, and T V is a set of states that satisfies w. For a state v ∈ V, this article uses v FT to denote a play that starts from v, satisfies w, and reaches some states in T. Meanwhile, this article uses Prðv FTÞ to denote the probability of this type of play. If the play also follows f, the probability of the play follow f is denoted as Pr f ðv FTÞ.

A variant of modal µ-calculus over APGs
Most modal/temporal logic can be viewed as a sub-logic of l-Calculus, where a powerful extension is modal l-Calculus. This logic is succinct in syntax, and formula variables are often used in such logic. The semantics of a l-calculus formula is defined by the Kripke structure, which designates the set of states that satisfy the formula (Kesten, Piterman & Pnueli, 2005). This article defines the variant of modal l-calculus over the APGs structure: Given an APG structure G : hV; A ct ; V; P e ; P s ; Li. For every state v ∈ V, the formulas p and ¬p are atomic formulas of G. Let Var ¼ fM; N; . . .g be a set of formula variables. The syntax of l-calculus formulas is defined by the following grammar: w ::¼ pj:pjXjw 1 _ w 2 jw 1^w2 j⊛wj⊚wjlXwjvXw A formula w is described as the set of G-states in which w is true. This article uses ½½w e G to denote such a set of states, indicating that the set satisfies w under e. Here, e : Var ! 2 V is an assignment that assigns formula variables to sets of atomic propositions in V. The set ½½w G ðeÞ is inductively defined as follows: A state m is included in ½½⊚w e G . If m is an environment state, it chooses an appropriate action to move into ½½w G ðeÞ; if m is a system state, it can choose any action to reach a state in ½½w G ðeÞ.
In addition, based on the syntax of the modal l-calculus formula, the following formulas can be further derived: TÞ is a set with the characteristic that if state v ∈ V is a system state, then all successors of v are in F ⊛ T; if state v is an environment state, then there exists an action such that a successor of v in F ⊛ T. Especially, if v is in T, it must be in F ⊛ T, whether v is an environment state or a system state. G ⊚ T ¼ vB:ð⊚B \ TÞis a set. If state v ∈ V is an environment state and it is in set T, then all successors of v are in G ⊚ T; if state v is a system state, there exists an action such that a successor state of v in G ⊚ T.
AG EFT ¼ mB:½l:YðÈY [ TÞ \ B is a set. If v ∈ AG EFT, any path from v can reach T infinitely times, where ⊗ is the AX operator of CTL, ⊕ is the EX operator of CTL.
Based on the above l-calculus formulas, the analysis of the state space of the game is presented in "Synthesis algorithm based on reinforcement learning".

SYNTHESIS PROBLEM THROUGH REWARD MECHANISM
In this section, the synthesis problem is defined based on reinforcement learning that stochastically approximates the value function of a probabilistic game. This article mainly focuses on the expected cumulative rewards for the system winning. The interaction of the system is first modeled with its environment as a reward asynchronous probabilistic game (RAPG), and it is defined as follows.
First, given an APG G ¼ hV; A ct ; V; P e ; P s ; Li and a linear temporal logic specification w. This article uses rewards as additional quantitative measures of the APG to stimulate the system to win the game. Although some researchers use cost mechanisms to describe minimization (e.g., energy consumption), reward mechanisms are commonly used to suggest a property that describes maximization (e.g., profit) (Nilim & El Ghaoui, 2005). In our work, the reward mechanism involves attaching a reward value to the positions and actions available in each state to motivate the player to win, and the reward accumulates over transitions. Formally, the rewards mechanism of an APG is defined as follows: Definition 3. The reward mechanism for an APG G ¼ hV; A ct ; V; P e ; P s ; Li is described as a specifying asynchronous reward structure, which is a tuple R ¼ ðR st ; R ac Þ composed of two reward functions. One is state reward function R st : V ! R that maps the state v of G to non-negative reals, and other is an action reward function R ac : V Â A ct ! R that maps state-action pairs (v, a) of G to non-negative reals, where v 2 V; a 2 A ct .
The action reward in a synchronous reward mechanism is also called a transition reward, impulse reward, or state-action reward. By definition, reward functions are Markovian, and they typically map states, or states and actions, to a scalar reward value.
Based on an APG and the asynchronous reward mechanism over it, thus article transforms the interaction of the system with its environment as a reward asynchronous probabilistic game (RAPG), which is a seven-tuple as follows: G ¼ hV; A ct ; V; P e ; P s ; L; Ri: In Zhao et al. (2022), the probabilistic reachability property is defined over APGs. This article discusses the expected reward properties that are defined over the above model RAPG. The model RAPG is an extension of APG and has all the properties of APG. Define the cumulative reward property over RAPG as: Definition 4. Let G be a RAPG and V be the state space over RAPG. For the winning condition w and a set T V, all plays starting from any states in T satisfy w. For an infinite path p ¼ v 0 ; v 1 ; Á Á Á of the game G, the cumulative reward for synchronous reward structure R along an infinite path R st ðv n Þ þ R ac ðv n ; a n Þ ½ : For an finite path p, define if v i = 2 T for 0 ≤ i ≤ n and v n 2 T; Note that Rðp; FTÞ denotes the cumulative reward earned along an infinite path p until some states in T are reached for the first time. Cumulative rewards property is usually used to handle the sum of rewards accumulated from a position (or state) to a specific position (or state). Meanwhile, many other reward-based properties can be defined over RAPGs, such as discounted reward and expected long-run average reward. The characteristic of discount rewards is that the reward gain in each step is the reward multiplied by a discount factor k (in general k < 1), so the strategy with fewer steps is generally preferred. Different from the discount reward, the expected long-run average reward considers the average reward earned in each state or transition.
Definition 5. Given a RAPG G, V is the set of state spaces over the game G; f is a strategy, w is a winning condition; T V is a specific set, and any play p starting from a state in T satisfies w; R ¼ ðR st ; R ac Þ is a synchronous reward structure. This article denotes the expected cumulative reward of a play that starts from v ∈ V and satisfies w until reaching some states in T as ERðv FTÞ: If the play follows f, this article uses ER f ðv FTÞ to represent its expected cumulative reward.
In addition, if Pr(v FT) = 0, then This article is interested in computing either the maximum or (and) the minimum value of the cumulative expected reward of a play. This problem is formally defined as follows: Definition 6. For a RAPG G, V is the set of state spaces over the game G; f is a strategy; w is a winning condition; T V is a specific set, and any play p starting from a state in T satisfies w; R ¼ ðR st ; R ac Þ a synchronous reward structure. To stimulate the system to win, a strategy f that maximizes (or minimizes) the expected cumulative reward ER(v FB) of the system (or environment) and satisfies w should be synthesized.

SYNTHESIS ALGORITHM BASED ON REINFORCEMENT LEARNING
This section discusses the problem of finding the learning-based synthesis algorithm based on the RAPG G. To compute the expected cumulative reward, this article first analyzes and discusses how to divide the state space of the RAPG G. Then, a highly efficient incremental probabilistic synthesis algorithm based on reinforcement learning is proposed.

Qualitative reachability
Consider a RAPG G, the winning condition w, and a set of states T V, where all plays starting from any states in T satisfies w. This article first uses l-calculus formulas to obtain a state set G ⊚ :T, in which all plays starting from any states of the do not satisfy w. Specifically, for v 2 G ⊚ :T, (a) there is at least one successor of v not in T if v is an environment state; (b) the successors of v are all not in T if v is a system state. If v 2 G ⊚ :T, and then Pr(v FT) = 0. Especially, the expected cumulative reward of all plays starting from the state in set G ⊚ :T is 0. That is, if v 2 G ⊚ :T, then ER(v FT) = 0. The set G ⊚ :T can be computed by using Algorithm 1. By analyzing the state space, computing abstractions of the whole state space can be effectively avoided.
To compute the expected cumulative reward, this article does not need to consider the sure-reachability play starting from the state v (v ∈ V) because it also needs some rewards to reach T, unless v ∈ T.

Synthesis through reinforcement learning
The goal of synthesis is to incentivize the system to satisfy the LTL winning condition through a reward mechanism. This goal can be achieved by computing the maximum expected cumulative reward for each state. Meanwhile, some standard techniques in the reinforcement learning literature can be adopted to find the satisfying strategies. Below, the calculation method is introduced in detail: Theorem 1. Consider an RAPG G, the state space V ¼ V E [ V S over G, the winning condition w, and a set of states T that satisfies w, i.e., all plays that start with the state in T as the initial state satisfies w, and an asynchronous reward structure R ¼ ðR st ; R ac Þ, where T V. Let x k v denote the maximizing value of the expected cumulative reward under strategy f in state v, where k ≥ 0 is the expected cumulative reward parameter. For the convenience and simplicity of expressing ER f ðv FTÞ, the definition x k v : x k v :¼ ER f ðv FTÞ is given. To compute the expected cumulative reward, there is no need to consider the state in B, and define x 0 v ¼ 0 for all v ∈ V. This simplifies the definition of the value x k v for each state: if v 2 V s , then Algorithm 2 below is a learning-based algorithm that computes the expected cumulative reward for each state and extracts strategies. The algorithm has good scalability.

CASE STUDY
In this section, two case studies are presented to illustrate our synthesis method. One is the problem of robot patrolling in a certain area, and the other is the problem of safety reachability of unmanned cars.

Robot patrolling
As shown in Fig. 1, the robot performs the task of patrol in an area, and this area is divided into four regions. The robot patrol route starts at region 1 and goes through region 2 and then region 3 to region 4. In this scenario, if it encounters a person in region 2 and region 3, the robot will stay in that region with the person. Meanwhile, if it encounters an unknown hazardous item, the robot will pick it up and take it to region 4. Furthermore, it is assumed that a hazardous item and people do not appear at the same time. If a hazardous item appears first, then people will not appear in the process of delivering a hazardous item. Even if a second hazard item appears, the robot sends the first hazard item to region 4 before processing the next task.
The game graph corresponding to Fig. 1 is illustrated in Fig. 2. Consider an RAPG G ¼ hV; A ct ; V; Â e ; Â s ; L; Ri, where V e ¼ fv 0 ; v 2 ; v 4 ; v 6 g is the set of environment states, V s ¼ fv 1 ; v 3 ; v 5 ; v 7 g is the set of system states, A ct ¼ fa; b; c; d; eg is the set of actions, and R ¼ ðR st ; R ac Þ is the synchronous reward structure. At the environment state, the environment can choose to put items and appear people to affect robot patrol. If the item is the hazard item, the robot will bring it to region 4. These two actions are denoted as a and b, respectively. At this time, the system can choose to pick up or stay, and these two actions are denoted as c and d, respectively. In this case, the robot identify the item as a hazard item and picks it up, which is considered the same action.
Algorithm 2: Learning-based algorithm for RAPG G 1: input: An RAPG G with finite space V, a state set T V, and G ⊚ :T 2: ∀v ∈ V : x 0 v = 0; 3: for k = 1, 2, · · · 4: if v ∈ T ∪ G ⊚ :T 5: x k v ¼ 0 6: else 7: ∀v ∈ V \ T:  According to this result, the expected cumulative reward for each state to reach T can be obtained. For example, ERðv 0 FTÞ ¼ 4:80, i.e., in the state v 0 , the expected cumulative reward of the robot (the system) to take the hazard item to region 4 is at least 4.80. Meanwhile, the environment always selects action b, which adopts the strategy to force the system with the minimum expected cumulative reward.

Safety of autonomous driving
Consider the reward asynchronous probabilistic game G depicted in Fig. 3, it can be regarded as a simplified version of the unmanned car. The game models an unmanned car driving on an urban road and gives a route to navigate. There are three intersections A, B, C in a road. When navigating the route, the car must autonomously and quickly react to dangers. Here, two dangers are considered: a pedestrian and a traffic jam. To avoid a traffic jam, the car changes the lane or honks. To avoid a pedestrian, the car brakes or honks. It is assumed that if the car goes through the first two intersections, it is safe to go through an intersection C.
The game graph corresponding to Fig. 3 is shown in Fig. 4. Consider an RAPG G ¼ hV; A ct ; V; P e ; P s ; L; Ri, where V e ¼ fv 0 ; v 2 ; v 4 ; v 6 ; v 8 g is a set of the environment states, V s ¼ fv 1 ; v 3 ; v 5 ; v 7 g is a set of the system states, A ct ¼ fa; b; c; d; e; f g is an action set, and R ¼ ðR st ; R ac Þ is a synchronous reward structure. At the environment state, the environment can choose to appear a traffic jam and a pedestrian to prevent the car from reaching the intersection C safely. These two actions are denoted as a and b, respectively. Assume that the road is normal, and the environment has taken action e. At this time, the system can choose to brake, honk or change lane, these three actions are denoted as c, d, and f, respectively. Assume that the car is running normally, and the system takes action g. The process of the game is as follows. Starting from the state v 0 , the car is running normally and at intersections A. If the environment takes action a, the probability of a traffic jam is 0.7, and the probability of not causing a traffic jam is 0.3. If the environment chooses action b, the probability that a pedestrian appears is 0.8, and the probability that no pedestrian appears is 0.2. If the car (the system) takes action c, the probability of avoiding a traffic jam is 0.95, and the probability of not avoiding a traffic jam is 0.05. If the car (the system) takes action f, the probability of avoiding a traffic jam is 0.8, and the probability of not avoiding a traffic jam is 0.2. If the car (the system) takes action d, the probability of avoiding pedestrians is 0.8, and the probability of not avoiding pedestrians is 0.2. When the road is normal, the probability of normal driving of the car is 1. At this time, it is assumed that the system takes action g and reaches the state determined by the environment with a probability of 0.5. Also, it is assumed that if the car safely reach intersection A and B, it can safely reach intersection C. The following is the specific game diagram of the environment and the system, where v 2 ; v 8 is the accident state with the tag acc, and v 7 is the state when the intersection C is reached safely. So, the set T ¼ fv 7 g satisfies the winning condition w and G ⊚ :T ¼ fv 2 ; v 8 g by Algorithm 1.
Full-size  DOI: 10.7717/peerj-cs.1094/ fig-4 k¼13:x 13 v 0 ¼6:10; x 15 v 8 ¼0 According to this result, the expected cumulative reward to reach the set T can be obtained. For example, ERðv 0 FTÞ ¼ 6:10, i.e., in the state v 0 , the expected cumulative reward of the car (the system) to safely go through an intersection C is at least 6.10. Meanwhile, in this state, a strategy that enables the car to reach the set T can be derived, i.e., the environment takes action b.
In addition, during the experiment, it is found that the smaller the convergence threshold h, the larger the expected cumulative parameter k, and the more accurate the experimental results x k v .

SUMMARY AND FUTURE WORK
This article studies reactive system synthesis problems and proposes a probabilistic model called reward asynchronous probabilistic games (RAPGs) for computing rewards in dynamic environments. Our model is motivated by players to choose actions through a reward mechanism, where the process generates rewards whose values depend on the state rewards and action rewards. The RAPGs are proposed with the LTL winning condition, which is a subclass of asynchronous probabilistic games. Meanwhile, the RAPGs can integrally describe the probabilistic behavior of the system and the environment. Besides, the synthesis algorithm is presented to compute the expected cumulative rewards. In addition, symbolic synthesis algorithms are provided for RAPGs to compute the maximum expected cumulative reward to satisfy the winning condition and synthesize the corresponding strategies. Our algorithm is formulated as a value iteration based on reinforcement learning. Our proposed algorithm works as follows. First, the reachability properties are encoded by l-calculus formulas. According to the reachability properties, a set of states is obtained in which all plays starting from any state do not satisfy the winning condition. It is shown that the expected cumulative reward of any play that starts from any state of the set is 0. The method is clear, simple, and convenient. More importantly, it shrinks the state space of the algorithm that computes the expected cumulative reward. Then, an asynchronous reward mechanism is defined based on the winning condition of the RAPGs to incentive system to win. Based on this algorithmic reward construction procedure, a reinforcement learning algorithm is introduced to synthesize the optimal strategies that obtain the maximum expected cumulative reward for the system to win.
One interesting topic for further study is synthesizing strategies for mean-payoff rewards under the GR(1) winning condition. Another is to improve the scalability of the synthesis techniques to handle large and complex models. Meanwhile, we will extend the pattern of specifications and develop some tools to support automatic probabilistic synthesis.