Strategy Synthesis for Zero-Sum Neuro-Symbolic Concurrent Stochastic Games

Neuro-symbolic approaches to artificial intelligence, which combine neural networks with classical symbolic techniques, are growing in prominence, necessitating formal approaches to reason about their correctness. We propose a novel modelling formalism called neuro-symbolic concurrent stochastic games (NS-CSGs), which comprise two probabilistic finite-state agents interacting in a shared continuous-state environment. Each agent observes the environment using a neural perception mechanism, which converts inputs such as images into symbolic percepts, and makes decisions symbolically. We focus on the class of NS-CSGs with Borel state spaces and prove the existence and measurability of the value function for zero-sum discounted cumulative rewards under piecewise-constant restrictions on the components of this class of models. To compute values and synthesise strategies, we present, for the first time, practical value iteration (VI) and policy iteration (PI) algorithms to solve this new subclass of continuous-state CSGs. These require a finite decomposition of the environment induced by the neural perception mechanisms of the agents and rely on finite abstract representations of value functions and strategies closed under VI or PI. First, we introduce a Borel measurable piecewise-constant (B-PWC) representation of value functions, extend minimax backups to this representation and propose a value iteration algorithm called B-PWC VI. Second, we introduce two novel representations for the value functions and strategies, constant-piecewise-linear (CON-PWL) and constant-piecewise-constant (CON-PWC) respectively, and propose Minimax-action-free PI by extending a recent PI method based on alternating player choices for finite state spaces to Borel state spaces, which does not require normal-form games to be solved.


Introduction
Game theory offers an attractive framework for analysing strategic interactions among agents in machine learning, with application to, for instance, the game of Go [1], autonomous driving [2] and robotics [3]. An important class of dynamic games is stochastic games [4], which move between states according to transition probabilities controlled jointly by multiple agents (also called players). Extending both strategic-form games to dynamic environments and Markov decision processes (MDPs) to multiple players, stochastic games have long been used to model sequential decision-making problems with more than one agent, ranging from multi-agent reinforcement learning [5], to quantitative verification and synthesis for equilibria [6].
Recent years have witnessed encouraging advances in the use of neural networks (NNs) to approximate either value functions or strategies [7] for stochastic games that model large, complex environments. Such end-to-end NNs directly map environment states to Q-values or actions. This means that they have a relatively complex structure and a large number of weights and biases, since they interweave multiple tasks (e.g., object detection and recognition, decision making) within a single NN. An emerging trend in autonomous and robotic systems is neuro-symbolic approaches, where some components that are synthesized from data (e.g., perception modules) are implemented as NNs, while others (e.g., nonlinear controllers) are formulated using traditional symbolic methods. This can greatly simplify the design and training process, and yield smaller NNs.
Even with the above advances, there remains a lack of modelling and verification frameworks which can reason formally about the correctness of neuro-symbolic systems. Progress has been made on techniques for both multi-agent verification [8,9] and safe reinforcement learning [10] in this context, but without the ability to reason formally about stochasticity, which is crucial for modelling uncertainty. Elsewhere, concurrent stochastic games (CSGs) have been widely studied [11,12,13,14,15], and also integrated into formal modelling and verification frameworks [6], but primarily in the context of finite state spaces, which are insufficient for many real-life systems.
We propose a new modelling formalism called neuro-symbolic concurrent stochastic games (NS-CSGs), which comprise two finite-state agents endowed with perception mechanisms implemented via NN classifiers and conventional, symbolic decision-making mechanisms. NN perception mechanisms assume real-valued inputs, which naturally result in continuous-state spaces that are partitioned according to the observations made by the NNs. Under the assumption that agents have full state observability and working with Borel state spaces, we establish restrictions on the modelling formalism which ensure that the NS-CSGs belong to a class of uncountable state-space CSGs [16] that are determined for zero-sum discounted cumulative objectives, and therefore prove the existence and measurablity of the value function for such objectives.
Next, we propose a new Borel measurable piecewise-constant (B-PWC) representation for the value function and show its closure under the minimax operator. Using this (finite) representation, we develop an implementable B-PWC VI algorithm for NS-CSGs that approximates the value of the game and prove the algorithm's convergence.
Then, we present a Minimax-action-free PI algorithm for NS-CSGs inspired by recent work for finite state spaces [17], which we generalise by using novel representations for the value functions and strategies, constantpiecewise-linear (CON-PWL) and constant-piecewise-constant (CON-PWC), to ensure finite representability and measurability. This allows us to overcome the main issue that arises when solving Borel state space CSGs with PI, namely that the value function may change from a Borel measurable function to a non-Borel measurable function across iterations.
The PI algorithm adopts the alternating player choices proposed in [17] and removes the need to solve normal-form games and MDPs at each iteration. To the best of our knowledge, these are the first implementable algorithms for solving zero-sum CSGs over Borel state spaces with convergence guarantees. Finally, we illustrate our approach by modelling a dynamic vehicle parking as an NS-CSG and synthesizing (approximately optimal) strategies using a prototype implementation of our B-PWC VI algorithm.
We note that we assume a fully observable game setting. While it is relatively straightforward to generalise NS-CSGs with partial observability, since NS-CSGs already include perception functions that generate observations, there are no general algorithmic methods for value and strategy computation in the partially observable game setting. We believe that an approach similar to [18,19], which converts imperfect-information games to perfect-information, can potentially be used to enable the solution of partially observable NS-CSGs.

Related work
Stochastic games were introduced by Shapley [4], who assumed a finite state space. Since then, many researchers, have considered CSGs with uncountable state spaces, e.g., [16,20,21]. Maitra and Parthasarathy [20] were the first to study discounted zero-sum CSGs in this setting, assuming that the state space is a compact metric space. Following this, more general results for discounted zero-sum CSGs with Borel state spaces have been derived, e.g., [16,22,21,23]. These aim at providing sufficient conditions for the existence of either values or optimal strategies for players.
Another important and practical problem for zero-sum CSGs with uncountable state spaces is the computation of values and optimal strategies. Since the seminal policy iteration (PI) methods were introduced by Hoffman and Karp [24] and Pollatschek and Avi-Itzhak [25], a wide range of fixed-point algorithms have been developed for zero-sum CSGs with finite state spaces [11,12,13,14]. Recent work by Bertsekas [17] proposed a distributed optimistic abstract PI algorithm, which inherits the attractive structure of the Pollatschek and Avi-Itzhak algorithm while resolving its convergence difficulties. Value iteration (VI) and PI algorithms have been improved for simple stochastic games [26,27]. However, all of the above approaches assume finite state spaces and, to the best of our knowledge, there are no existing VI or PI algorithms for CSGs with uncountable, or more specifically Borel, state spaces. VI and PI algorithms for stochastic control (i.e., the one player case) with Borel state spaces can be found in [28,29]. Other problems for zero-sum CSGs with uncountable state spaces have been studied and include information structure [30], specialized strategy spaces [31], continuous time setup [32] and payoff criteria [23].
A variety of other objectives, for instance, mean-payoff [33,34], ratio [34] and reachability [35,36] objectives, have also been studied for CSGs [11,12,13,14]. But these are primarily in the context of finite/countable state spaces which, as argued above, are insufficient for our setting where uncountable real vector spaces are usually supplied as inputs to NNs. We remark that, building on an earlier version of this work [37], there has been recent progress on solving NS-CSGs [38], but focusing on finite-horizon objectives and using equilibria-based (nonzero-sum) properties.
Finally, we note that this paper assumes a fully observable game setting; a natural extension would be partially observable stochastic games (POSGs), for which there are no general VI and PI computation algorithms. A variant of POSGs, called factored-observation stochastic games (FOSGs), was recently proposed [19] that distinguishes between private and public observations in a similar fashion to our model, but for finite-state models without NNs. Partial observability in FOSGs is dealt with via a mechanism that converts imperfect-information games into continuous-state (public belief state) perfect-information games [18,19], such that many techniques for perfectinformation games can also be applied. Our fully observable model can arguably serve as a vehicle to later solve the more complex case with imperfect information.

Background
In this section we summarise the background notation, definition and concepts used in this paper.

Borel measurable spaces and functions
Given a non-empty set X, we denote its Borel σ-algebra by B(X), and the sets in B(X) are called Borel sets of X. The pair (X, B(X)) is a (standard) Borel space if there exists a metric on X that makes it a complete separable metric space (unless required for clarity, B(X) will be omitted). For convenience we will work with real vector spaces; however, this is not essential and any complete separable metric spaces could be used. For Borel spaces X and Y , a function f : We denote by F(X) the space of all bounded, Borel measurable realvalued functions on a Borel space X, with respect to the unweighted supnorm ∥J∥ = sup x∈X |J(x)| for J ∈ F(X). For functions J, K ∈ F(X), we use max[J, K] and min[J, K] to denote the respective pointwise maximum and minimum functions of J and K, i.e., we have opt[J, K](x) := opt{J(x), K(x)} for opt ∈ {min, max} and x ∈ X.
We now introduce notation and definitions for concepts that are fundamental to the abstraction on which our algorithms are performed. The abstraction is based on a decomposition of the uncountable state space into finitely many abstract regions. In the definitions below, let X ⊆ R n 1 and Y ⊆ R n 2 for n 1 , n 2 ∈ N.
Definition 1 (FCP and Borel FCP). A finite connected partition (FCP) of X, denoted Φ, is a finite collection of disjoint connected subsets (regions) that cover X. Furthermore, Φ is a Borel FCP (BFCP) if each region ϕ ∈ Φ is a Borel set of X.
For BFCPs Φ 1 and Φ 2 of X, we denote by Φ 1 + Φ 2 the smallest BFCP of X such that Φ 1 + Φ 2 is a refinement of both Φ 1 and Φ 2 , which can be obtained by taking all the intersections between regions of Φ 1 and Φ 2 .

Probability measures
Let X be a Borel space. A function f : B(X) → [0, 1] is a probability measure on X if f (X) = 1 and i∈I f (B i ) = f (∪ i∈I B i ) for any countable disjoint family of Borel sets (B i ) i∈I . We denote the space of all probability measures on a Borel space X by P(X). For Borel spaces X and Y , a Borel measurable function σ : Y → P(X) is called a stochastic kernel on X given Y (also known as a transition probability function from Y to X), and we denote by P(X | Y ) the set of all stochastic kernels on X given Y . If σ ∈ P(X | Y ), y ∈ Y and B ∈ B(X), then we write σ(B | y) for σ(y)(B). It follows that σ ∈ P(X | Y ) if and only if σ( · | y) ∈ P(X) for all y ∈ Y and σ(B | · ) is Borel measurable for all B ∈ B(X).

Neural networks
A neural network (NN) is a real vector-valued function f : , where x i is the input to the ith layer given by the output h i−1 (x i−1 ) of the (i − 1)th layer, act i is an activation function, and W i x i + b i is a weighted sum of x i for a weight matrix W i and a bias vector b i . An NN f is continuous for all popular activation functions, e.g., Rectified Linear Unit (ReLU), Sigmoid and Softmax [39]. An NN f is said to be a classifier for a set of classes C of size c if, for any input x ∈ R m , the output f (x) ∈ R c is a probability vector where the ith element of f (x) represents the confidence probability of the ith class of C, i.e., a classifier is a function f : R m → P(C).

Concurrent stochastic games
Finally, in this section, we recall the model of two-player concurrent stochastic games. • N = {1, 2} is a set of two players; • S is a finite set of states; where A i is a finite set of actions available to player i ∈ N and ⊥ is an idle action disjoint from the set A 1 ∪ A 2 ; • ∆ : S → 2 (A 1 ∪A 2 ) is an action available function; • δ : (S×A) → P(S) is a probabilistic transition function.
In a state s of a CSG G, each player i ∈ N selects an action from its available actions, i.e., from the set ∆(s) ∩ A i , if this set is non-empty, and selects the idle action ⊥ otherwise. We denote the action choices for each player i in state s by A i (s), i.e., A i (s) equals ∆(s) ∩ A i if ∆(s) ∩ A i ̸ = ∅ and equals {⊥} otherwise and by A(s) the possible joint actions in a state, i.e., A(s) = A 1 (s) × A 2 (s). Supposing each player i chooses action a i , then with probability δ(s, (a 1 , a 2 ))(s ′ ) there is a transition to state s ′ ∈ S. A path π of G is a sequence π = s 0 − → · · · such that s k ∈ S, α k ∈ A(s k ) and δ(s k , α k )(s k+1 ) > 0 for all k ≥ 0. We let FPaths G and IPaths G denote the sets of finite and infinite paths of G, respectively. For a path π, we denote by π(k) the (k + 1)th state, and π[k] the action for the transition from π(k) to π(k + 1).
A strategy for a player of a CSG G resolves its action choices in each state. These choices can depend on the history of the CSG's execution and can be randomised. Formally, a strategy for player i is a function σ i : FPaths G → P(A i ∪{⊥}) mapping finite paths to distributions over available actions, such that, if σ i (π)(a i )>0, then a i ∈ A i (last(π)) where last(π) is the final state of π. A strategy is said to be stationary if it makes the same choices for paths that end in the same state. Furthermore, a strategy profile of G is a pair σ = (σ 1 , σ 2 ) of strategies for each player. Given a strategy profile σ and state s, letting IPaths σ s denote the set of infinite paths from s under the choices of σ, we can define a probability measure Prob σ s ∈ P(IPaths σ s ) [40].

Zero-sum neuro-symbolic concurrent stochastic games
This section introduces our model of neuro-symbolic concurrent stochastic games (NS-CSGs). We restrict attention to two-agent (which we also refer to as two-player) games as we are concerned with zero-sum games in which there are two agents with directly opposing objectives. However, the approach extends to multi-agent games, by allowing the agents to form two coalitions with directly opposing objectives. An (two-agent) NS-CSG comprises two interacting neuro-symbolic agents acting in a shared, continuous-state environment. Each agent has finitely many local states and actions, and is endowed with a perception mechanism implemented as an NN through which it can observe the state of the environment, storing the observations locally in percepts. Definition 6. A (two-agent) neuro-symbolic concurrent stochastic game (NS-CSG) C comprises agents (Ag i ) i∈N for N = {1, 2} and environment E where: δ E ) and we have: • S i = Loc i ×Per i is a set of states for Ag i , and Loc i ⊆ R b i and Per i ⊆ R d i for b i , d i ∈ N are finite sets of local states and percepts, respectively; • S E ⊆ R e for e ∈ N is a closed infinite set of environment states; • A i is a nonempty finite set of actions for Ag i , and A : • ∆ i : S i → 2 A i is an available action function for Ag i , defining the actions the agent can take in each of its states; • obs i : (Loc 1 × Loc 2 × S E ) → Per i is a perception function for Ag i , mapping the local states of the agents and environment state to a percept of the agent, implemented via an NN classifier for the set Per i ; is a probabilistic transition function for Ag i determining the distribution over the agent's local states given its current state and joint action; • δ E : (S E × A) → S E is a deterministic transition function for the environment determining its next state given its current state and joint action.
In an NS-CSG C the agents and environment execute concurrently and agents move between their local states probabilistically. For simplicity, we consider deterministic environments, but all the results extend directly to probabilistic environments with finite branching. A (global) state of an NS-CSG comprises a state s i = (loc i , per i ) for each agent Ag i (a local-state-percept pair) and an environment state s E . A state s = ((loc 1 , per 1 ), (loc 2 , per 2 ), s E ) is percept compatible if per i = obs i (loc 1 , loc 2 , s E ) for 1 ≤ i ≤ 2. In state s = (s 1 , s 2 , s E ), each Ag i simultaneously chooses one of the actions available in its state s i (if no action is available, i.e., ∆ i (s i ) = ∅, then Ag i chooses the idle action ⊥), resulting in a joint action α = (a 1 , a 2 ) ∈ A. Next, each Ag i updates its local state to some loc ′ i ∈ Loc i , according to the distribution δ i (s i , α). At the same time, the environment updates its state to some s ′ E ∈ S E according to the transition δ E (s E , α). Finally, each Ag i , based on its new local state, observes the new local state of the other agent and the new environment state to generate a new percept per ′ i = obs i (loc ′ 1 , loc ′ 2 , s ′ E ). Thus, the game reaches the state Example 1. As an illustration, we present an NS-CSG model of a dynamic vehicle parking problem (a static version is presented in [41]).  shows two agents, Ag 1 (the red vehicle) and Ag 2 (the blue vehicle), in a (continuous) environment R = {(x, y) ∈ R 2 | 0 ≤ x, y ≤ 4} and two parking spots ps 1 , ps 2 ∈ R (the green circles), which are known to the agents. The perception function of the agents uses an NN classifier f : R → Per , where Per = {1, 2, 3, 4} 2 (we assume N ⊆ R), which takes the coordinates of a vehicle or parking spot as input and outputs one of the 16 abstract grid points, thus partitioning the environment, see Fig. 1 (centre).
The actions of the agents are to move either up, down, left or right, or park. The vehicles of the agents start from different positions in R and have the same speed. The red agent initially chooses one parking spot and changes its parking spot with probability 0.5 when the blue agent is observed to be closer to its chosen parking spot and both agents move towards this spot, see Fig. 1 (centre and right). Formally, the agents and the environment are defined as follows.
• Loc 1 = {ps 1 , ps 2 } and Loc 2 = {⊥}, i.e., the local state of Ag 1 is its current chosen parking spot and the local state of Ag 2 is a dummy state. For 1 ≤ i ≤ 2, the set of percepts of Ag i is given by Per i = Per × Per , representing the abstract grid points that each agent perceives as the positions of the two vehicles.
• S E = R × R, i.e., the environment is in state s E = (w 1 , w 2 ) if w i is the continuous coordinate of Ag i 's vehicle for 1 ≤ i ≤ 2.
• For any 1 ≤ i ≤ 2, loc i ∈ Loc i and (per 1 , f (ps 2 )} and equal to A i \ {park } otherwise, i.e., an agent's available actions are to move up, down, left and right, and additionally park when the agent is perceived to have reached a parking spot.

Semantics of an NS-CSG
The semantics of an NS-CSG C is a CSG C over the product of the states of the agents and the environment formally defined as follows.
Notice that the CSG C is over percept compatible states and that, by definition of obs i for each agent Ag i , the underlying transition relation δ is closed with respect to percept compatible states. Since δ E is deterministic and Loc i is a finite set, the set of successors of s under α, denoted Θ α s = {s ′ | δ(s, α)(s ′ ) > 0}, is finite for all s ∈ S and α ∈ A(s). While the semantics of an NS-CSG is an instance of the general class of uncountable state space CSGs, its particular structure induced by perception functions (see Definition 6) will be important in order to establish measurability and finite representability to allow us to derive our algorithms.

Zero-sum NS-CSGs
For an NS-CSG C, the objectives we consider are discounted accumulated rewards, and we assume the first agent tries to maximise the expected value of this objective and the second tries to minimise it. More precisely, for a reward structure r = (r A , r S ), where r A : (S × A) → R is an action reward function and r S : S → R is a state reward function, and discount factor β ∈ (0, 1), the accumulated discounted reward for a path π of C over the infinite-horizon is defined by: Example 2. Returning to the dynamic vehicle parking model of Example 1, we suppose the objective for Ag 1 is to try and park at its current parking spot without crashing into Ag 2 and, since we consider zero-sum NS-CSGs whose objectives must be directly opposing, the objective of Ag 2 is to try to crash into Ag 1 and prevent it from parking. We can represent this scenario using a discounted reward structure, where all action rewards are zero and for the state rewards we set: there is a negative reward if it is perceived that Ag 1 has yet to reach its current parking spot and the agents have crashed; a positive reward if it is observed that Ag 1 has reached its parking spot which is higher if the agents are not perceived to have crashed; and 0 otherwise.
) and 1 ≤ i ≤ 2, we define the state reward function as follows: For the discount factor, we let β = 0.6. ■

Strategies of NS-CSGs
Since the state space S is uncountable due to the continuous environment state space, we follow the approach of [16] and require Borel measurable conditions on the choices that the strategies can make to ensure the measurability of the induced sets of paths.
The semantics of any NS-CSG will turn out to be an instance of the class of CSGs from [16], for which stationary strategies achieve optimal values [16, Theorem 2(ii), Theorem 3], and therefore, to simplify the presentation, we restrict our attention to stationary strategies and refer to them simply as strategies. Before we give their formal definition, since we work with real vector spaces we require the following lemma.
Proof. By Theorem 27 [42, Chapter 9.6] and Theorem 12 [42, Chapter 9.4], S 1 , S 2 and S E are complete separable metric spaces, and hence are Borel spaces. Furthermore, we have that S 1 × S 2 × S E is the Cartesian product of Borel spaces, and therefore, using Theorem 1.10 [43, Chapter 1], is also a Borel space. Since we assume obs i is Borel measurable for 1 ≤ i ≤ 2 (see Assumption 1), for (loc i , per i ) ∈ S i and 1 ≤ i ≤ 2, the set: is a pair of strategies for each agent. We denote by Σ i the set of all strategies of Ag i and by Σ = Σ 1 × Σ 2 the set of profiles.

Assumptions on NS-CSGs
Finally, in this section we list the assumptions over NS-CSGs that are required for the results presented in the remainder of the paper. First, NS-CSGs are designed to model neuro-symbolic agents, whose operation depends on particular perception functions, which may result in imperfect information. However, we assume full observability, i.e., where agents' decisions can depend on the full state space. It is straightforward to extend the semantics above to partially observable CSGs (POSGs) [44,45] where, for any state, each agent's observation function returns the agent's observable component of the state, by restricting to observationally-equivalent strategies, but this comes at a significant increase in complexity. Instead, we focus on full observability, which can serve as a vehicle to solve the more complex imperfect information game via an appropriate adaptation of the belief-space construction.
Regarding the structure of NS-CSGs, we make the following assumptions to ensure determinacy and that our finite abstract representations of value functions and strategies are closed under VI and PI. Assumption 1. For any NS-CSG C and reward structure r = (r A , r S ): For simplicity, in our formalisation we assume that percepts (and local states) of agents are drawn from finite sets of real-valued vectors; however, any finite sets could be used.
The above assumptions for NS-CSGs differ from existing stochastic games with Borel state spaces [16,22,23] in that the states have both discrete and continuous elements, while the perception and reward functions are required to be B-PWC. The B-PWC requirements in Assumption 1(ii) and (iii) and BFCP invertibility in Assumption 1(i) are needed to achieve B-PWC closure, and hence ensure finitely many abstract state regions (and are used in Lemmas 2, 3, 4 and Theorem 2 below). Bimeasurability in Assumption 1(i) ensures the existence of the value of an NS-CSG with respect to a reward structure (and is used in Proposition 1).
In the case that the perception function obs i of each agent Ag i is implemented via an NN classifier f i : R m → P(P er i ) (see Section 2.3), we have that, since f i is continuous, it is also Borel measurable. However, to ensure that the corresponding perception function obs i satisfies Assumption 1(ii), we need to consider situations where the class with the highest probability returned by f i is not unique. To resolve such cases we use a tie-breaking rule defined by a function κ i : 2 Per i → Per i which, given a set of percepts, i.e., those with the highest probability, returns the selected percept. Then requiring κ i to be a Borel measurable function is sufficient for Assumption 1(ii) to hold.
Example 3. Returning to Example 1, we now give two potential observation functions for the agents meeting the above assumptions.
The first is via the linear regression model for multi-class classification with boundaries given by where the environment boundaries are excluded as they do not split two different classes. The Borel measurable tie-breaking rule used here is assigning boundary points to the left and lower discrete coordinate, e.g., the class of environment state (2, 3, 3.1, 1.7) is (2,3,4,2).
The second is implemented via the product of a feed-forward NN classifier f : R 2 → P(Per ) with itself, i.e., f × f , where Per is the set of 16 abstract grid points, see Fig. 1 (right). This NN f has one hidden ReLU layer with 10 neurons. We break ties using a total order over the abstract grid points, which is Borel measurable. ■.
We discuss the case when perception functions are implemented using ReLU networks in more detail in Section 6, but we remark that Assumption 1(ii) allows a wider range of observation functions than just NNs for implementing perception mechanisms.

Game structures for NS-CSGs
In this section, we present three finite representations for the continuous state space of an NS-CSG. These take the form of BFCPs with respect to the perception, reward and transition functions of the NS-CSG. Recall, from Section 2, that a BFCP of a set is a finite family of disjoint Borel sets (regions) that cover the set. Using Assumption 1, we construct these BFCP over the state space such that the states in each region are equivalent with respect to either the perception, reward or transition function, e.g., for any region of the perception BFCP all states in the region yield the same percept. These BFCPs allow us to abstract an uncountable state space into a finite set of regions when performing our VI and PI algorithms. In particular, Sections 6 and 7 demonstrate how these different BFCPs can be used together with intersection, image and pre-image operations, to iteratively refine the abstract representations of the environment while maintaining the necessary conditions for correctness and convergence of value functions.
For the remainder of this section we fix an NS-CSG C and reward structure r.
Lemma 2 (Perception BFCP). There exists a smallest BFCP of S, called the perception BFCP, denoted Φ P , such that, for any ϕ ∈ Φ P , all states in ϕ have the same agents' states, i.e., if (s 1 , Proof. For 1 ≤ i ≤ 2, since obs i is PWC and S i is finite, using Definition 6 we have that, for any can be expressed as a number of disjoint regions of S E and we let Φ s 1 ,s 2 E be such a representation that minimises the number of the regions. It then follows that Φ P : is a smallest FCP of S such that all states in any region have the same agents' states. Next we prove that Φ P is a BFCP of S. We consider a region ϕ ∈ Φ P . Thus all states in ϕ have the same agents' states, say s 1 = (loc 1 , per 1 ) and s 2 = (loc 2 , per 2 ). According to Assumption 1, obs i (loc 1 , loc 2 , · ) : S E → Per i for 1 ≤ i ≤ 2 is B-PWC. The pre-image of (per 1 , per 2 ) under obs 1 and obs 2 over S given s 1 = (loc 1 , per 1 ) and s 2 = (loc 2 , per 2 ), denoted obs −1 (per 1 , per 2 | s 1 , s 2 ), equals: and therefore is a Borel set of S. Since Φ P is the smallest such partition of S, the regions in Φ P , which lead to the percept (per 1 , per 2 ) given s 1 and s 2 , have no common boundary. Thus, obs −1 (per 1 , per 2 | s 1 , s 2 ) is a finite union of disjoint regions in Φ P which include the agents' states s 1 and s 2 . Thus, each such region is a Borel set of S, meaning that ϕ ∈ B(S). Thus, Φ P is a BFCP of S. □ Lemma 3 (Reward BFCP). For each α ∈ A, there exists a smallest BFCP of S, called the reward BFCP of S under α and denoted Φ α R , such that for any ϕ ∈ Φ α R all states in ϕ have the same state reward and action reward when α is chosen, i.e., if s, R is a BFCP of S by a similar argument to that in the proof of Lemma 2. □ Using Assumption 1, we show that, given any joint action α, the perception BFCP Φ P can be refined into a new BFCP, such that the states in each region of this BFCP all reach, under the transition function of C , the same regions of the image of Φ P under the transition function. This result will be used for the existence of the value of C and in our algorithms.
It remains to consider the case when δ(s, α) is defined. Considering any ϕ ′ ∈ Φ P , by Lemma 2 there exists agent states We have the following two cases.
Finally, we divide ϕ into a BFCP ϕ ′ ∈Φ P Φ ′ (ϕ, ϕ ′ ), and therefore each region of this BFCP has the required reachability consistency. □ Example 4. Returning to Example 1, we now give the perception BFCPs for the observation functions proposed in Example 3. In each case the perception BFCP is of the form Φ P = Loc 1 × Loc 2 × Φ E , where Φ E is a BFCP for the environment state space and the perception BFCP is also the reward BCFP Φ α R for α ∈ A.
For the first observation function, which uses a linear regression model, the BFCP Φ E for the environment state space is given by: as shown in Fig. 2 (left), where the Borel measurable tie-breaking rule is used for the boundary points. For the second observation function, the BFCP Φ E can be found by computing the pre-images of each feed-forward NN classifier [46], and is shown in Fig. 2 (right). ■

Values of zero-sum NS-CSGs
We now proceed by establishing the value of an NS-CSG C with respect to an objective Y , i.e., for a reward structure r and discount factor β. We prove the existence of this value, which is a fixed point of a minimax operator. Using Banach's fixed-point theorem, a sequence of bounded, Borel measurable functions converging to this value is constructed.
Given a state s and (strategy) profile σ = (σ 1 , σ 2 ) of C , we denote by E σ s [Y ] the expected value of the objective Y when starting from state s, given by (1). The functions V , V : S → R, where s ∈ S: are called the lower value and upper value of Y , respectively.
Definition 9 (Value function). If V (s) = V (s) for all s ∈ S, then C is determined with respect to the objective Y and the common function is called the value of C , denoted by V ⋆ , with respect to Y .
We next introduce the spaces of feasible state-action pairs and state-actiondistribution tuples, and present properties of these spaces. More precisely, for 1 ≤ i ≤ 2, we let: Lemma 5 (Borel sets). For 1 ≤ i ≤ 2, the sets Ξ i and Λ i are Borel sets of S × A i and S × P(A i ), respectively. Furthermore, the sets Ξ 12 and Λ 12 are Borel sets of S × (A 1 × A 2 ) and S × (P(A 1 ) × P(A 2 )), respectively.
Proof. We first consider Ξ i and Λ i for i = 1 (the case for i = 2 follows similarly). Since A 1 is finite, the sets Ξ 1 and Λ 1 can be rearranged as: SinceÂ 1 is a subset of the finite set A 1 , the setsÂ 1 and P(Â 1 ) are Borel sets of A 1 and P(A 1 ), respectively. Since S 1 is a finite set, for anyÂ 1 ⊆ A 1 , the set {s 1 | ∆ 1 (s 1 ) =Â 1 } is a Borel set of S 1 . Since S 2 and S E are both Borel sets by Lemma 1, the result follows by Theorem 1.10 [43, Chapter 1]. Using similar reasoning, it follows that Ξ 12 and Λ 12 are also Borel sets of the respective spaces. □ Proposition 1 (Stochastic kernel transition function). The probabilistic transition function δ of C is a stochastic kernel.
Proof. From Definition 7, it follows that, for any (s, α) ∈ Ξ 12 , we have δ(s, α)( · ) ∈ P(S). We show that, if B ∈ B(S), then δ( · , · )(B) : (S × A) → R is Borel measurable on Ξ 12 . More precisely, we prove that, for any c ∈ R, the pre-image of the Borel set [c, ∞) of R under δ( · , · )(B) which is given by: . Therefore, it remains to consider the case when 0 < c ≤ 1. Consider any α ∈ A and let Φ α P be the refinement of Φ P of Lemma 4. For each ϕ ∈ Φ α P and ϕ ′ ∈ Φ P such that p α (ϕ, ϕ ′ ) > 0, let q α : ϕ → ϕ ′ be the associated bimeasurable, BFCP invertible function from Lemma 4. The image of ϕ under q α into ϕ ′ is given by: By Lemmas 2 and 4, both ϕ and ϕ ′ are Borel sets and q α is bimeasurable, and thereforeq α (ϕ, ϕ ′ ) is a Borel set. Next, since q α is Borel measurable, the pre-image of the Borel setq α (ϕ, ϕ ′ ) ∩ B under q α over the region ϕ, which is given by: By combining this result with Lemma 4, each state in q −1 α (ϕ,q α (ϕ, ϕ ′ ) ∩ B) under α transitions to B with probability p α (ϕ, ϕ ′ ). We denote the set of all transition probabilities from ϕ under α by P α (ϕ) = {p α (ϕ, ϕ ′ ) > 0 | ϕ ′ ∈ Φ P }. Then, the collection of the subsets of P α (ϕ) for which the sum of their elements is greater or equal to c is given by: c and is finite. Now for each set P ′ ∈ P ≥c α (ϕ), the states in the set: transition to B under α with probability greater or equal to c and O α (ϕ, P ′ ) is a Borel set as P ′ is a finite set. Thus, the states in ϕ reaching B under α with probability greater or equal to c are given by: where for any α ∈ A(s): Q(s, α, V ) := r A (s, α) + r S (s) + β s ′ ∈Θ α s δ(s, α)(s ′ )V (s ′ ) .

Theorem 1 (Value function). If C is an NS-CSG and Y is a discounted zero-sum objective, then
(i) C is determined with respected to Y , i.e., V ⋆ exists; (ii) V ⋆ is the unique fixed point of the operator T ; (iii) V ⋆ is a bounded, Borel measurable function.
Proof. The proof is through showing that C is an instance of a zero-sum stochastic game that satisfies the conditions of the Borel model presented in [16]. From Lemma 1, we have that A 1 , A 2 and S are complete and separable metric spaces. By Lemma 5, the spaces Ξ i and Λ i are Borel sets of S ×A i and S ×P(A i ) for 1 ≤ i ≤ 2, respectively. By Proposition 1, δ is a Borel stochastic kernel. Furthermore, from Assumption 1 we have that r A + r S : (S × A) → R is bounded, and therefore it follows that C with respect to the zero-sum objective Y is an instance of a zero-sum stochastic game with Borel model and discounted payoffs introduced in [16]. Hence, (i) follows from [16, Theorems 2 and 3], and (ii) from the discounted case of [16, Theorem 1]. Finally, for (iii), since β ∈ (0, 1), we have that V ⋆ is bounded, and therefore V ⋆ is Borel measurable using [16, Lemma 3]. □ The following guarantees that value iteration (VI) converges to the value function.
Proposition 2 (Convergence sequence). For any V 0 ∈ F(S), the sequence (V t ) t∈N , where V t+1 = T V t , converges to V ⋆ . Moreover, each V t is bounded, Borel measurable.
Proof. Since r A + r S : (S × A) → R is bounded, using [16, Lemma 2] we have that, if V t is bounded, Borel measurable, then T V t is also bounded. The result then follows from the fact that V ⋆ (s) = lim t→∞ V t (s) for all s ∈ S if V t+1 = T V t for all t ∈ N [16]. □

Value iteration
Despite the convergence result of Proposition 2, in practice there may not exist finite representations of general bounded Borel measurable functions (V t ) t∈N due to the uncountable state space. We now show how VI can be used to approximate the values of C , based on a sequence of B-PWC functions.

B-PWC closure and convergence
For NS-CSGs, we demonstrate that, under Assumption 1, a B-PWC representation of value functions is closed under the minimax operator and ensures the convergence of VI.
Theorem 2 (B-PWC closure and convergence). If V ∈ F(S) and B-PWC, then so is Q( · , α, V ) and [T V ] for α ∈ A. If V 0 ∈ F(S) and B-PWC, the sequence (V t ) t∈N such that V t+1 = T V t converges to V ⋆ , and each V t is B-PWC.
Proof. Considering any B-PWC function V ∈ F(S) and joint action α ∈ A, since r A ( · , α) + r S ( · ) is B-PWC by Assumption 1, the fact that Q( · , α, V ) is B-PWC follows if, by Definition 10, we can show that the function Q( · , α, V ) where: Boundedness follows because V is bounded. The indicator function of a subset S ′ ⊆ S is the function χ S ′ : S → R such that χ S ′ (s) = 1 if s ∈ S ′ and 0 otherwise. Now χ S ′ is Borel measurable if and only if S ′ is a Borel set of S [42]. For clarity, we use q α (s; ϕ, ϕ ′ ) to refer to q α from Lemma 4 for α ∈ A, s ∈ ϕ, ϕ ∈ Φ α P and ϕ ′ ∈ Φ P (where again Φ α P is from Lemma 4). For any s ∈ S such that δ(s, α) is defined, we have: Since ϕ is a Borel set of S, we have that χ ϕ is Borel measurable. Next, we show that V (q α ( · ; ϕ, ϕ ′ )) is Borel measurable on ϕ. Let Φ V be a constant-BFCP of S for V . Given c ∈ R, we denote by Φ ≥c V the set of regions in Φ V on which V ≥ c holds. The pre-image of [c, ∞) under V (q α ( · ; ϕ, ϕ ′ )) defined on ϕ is given by: Since q α (s; ϕ, ϕ ′ ) is Borel measurable in s ∈ ϕ (see Lemma 4) and ϕ V is a Borel set of S, then {s ∈ ϕ | q α (s; ϕ, ϕ ′ ) ∈ ϕ V } is a Borel set of ϕ. Since to be a pre-image BFCP of Φ V t for δ; (b) compute a value for each ϕ ∈ Φ V t+1 : take one state s ∈ ϕ and compute V t+1 by assigning to each region ϕ the value [T V t ](s).
Next, since q α ( · ; ϕ, ϕ ′ ) is BFCP invertible on ϕ by Lemma 4, there exists a BFCP Φ q of ϕ such that all states in each region of Φ q are mapped into the same region of Φ V under q α ( · ; ϕ, ϕ ′ ). Following this, V (q α ( · ; ϕ, ϕ ′ )) is constant on each region of Φ q . Therefore, using the fact that χ ϕ is PWC, it follows that Q( · , α, V ) is PWC, which completes the proof that Q( · , α, V ) is B-PWC.
From Proposition 2 we have that [T V ] is bounded, Borel measurable. Since Q( · , α, V ) is PWC for any joint action α ∈ A, A(s) is PWC and A is finite, it follows that [T V ] is PWC using the fact that the value of a zero-sum normal-formal game induced at every s ∈ S is unique. Thus, [T V ] is B-PWC. The remainder of the proof follows directly from Banach's fixed point theorem and the fact we have proved that, if V ∈ F(S) and B-PWC, so is [T V ]. □

B-PWC VI
We use the closure property of B-PWC value functions under the minimax operator from Theorem 2 to iteratively construct a sequence (V t ) t∈N of such functions to approximate V ⋆ to within a convergence guarantee. Algorithm 1 presents our B-PWC VI scheme, where the BCFP of the B-PWC value function at each iteration is refined (line 6) and subsequently the B-PWC value function is updated via minimax computations (line 8) for a state sampled from each of its regions.
Initialization. The function V 0 is initialised as a 0-valued B-PWC function defined over the BFCP Algorithm 1 B-PWC VI 1: Input: NS-CSG C, perception FCP Φ P , reward FCPs (Φ α R ) α∈A , error ε 2: Output: Approximate value function 10: The algorithm. The steps of our B-PWC VI algorithm are illustrated in Fig. 3. These steps use Preimage BFCP (Φ V t , Φ P , (Φ α R ) α∈A ), see Algorithm 2, to compute a refinement of Φ P + α∈A Φ α R that is a pre-image BFCP of Φ V t for δ. Then, in order to compute the value V t+1 ϕ over each region ϕ ∈ Φ, we take one state s ∈ ϕ and then find the value of a zero-sum normal form game [47] at s induced by Definition 10.
As a convergence criterion for B-PWC VI in Algorithm 1, we detect when the difference between successive value approximations falls below a threshold ε (as usual for VI, this does not guarantee an ε-optimal solution). The function Dist(V t+1 , V t ) computes the difference between V t+1 and V t , which may have different regions due to the possible inconsistency between Φ V t+1 and Φ V t . An intuitive method is to evaluate V t+1 and V t at a finite set of points, and then compute the maximum difference. In the usual manner for VI, an approximately optimal strategy can be extracted from the final step of the computation.
Algorithm 2 requires region-wise computations involving the image and pre-image of a region, region intersection and the sum of BFCPs. In particular, Intersect(ϕ, Φ ϕ pre ) is the refinement of ϕ obtained by computing all pairwise intersections of ϕ with regions in Φ ϕ pre and, by construction, is a pre-image BFCP of Φ for δ over ϕ. The following corollary then follows from Lemma 4 and Theorem 2.
Corollary 1 (BFCP iteration for B-PWC VI). In Algorithm 2, Φ pre is a refinement of Φ P + α∈A Φ α R and is a pre-image BFCP of Φ for δ.
Polytope regions. Our B-PWC VI algorithm assumes that each region in a BFCP is finitely representable. We now briefly discuss the use of BFCPs defined by polytopes, which suffice for perception BCFPs of ReLU NNs (discussed below). The focus is the region-based computations required by , respectively, then the intersection ϕ 1 ∩ ϕ 2 , is the intersection of ℓ halfspaces and can be represented as {(W k , b k )} ℓ k=1 . Therefore, the sum Φ 1 + Φ 2 of two BFCPs Φ 1 and Φ 2 can be computed by considering the intersection ϕ 1 ∩ϕ 2 of all pairwise combinations of regions ϕ 1 ∈ Φ 1 and ϕ 2 ∈ Φ 2 .
The image of a polytope ϕ = {x ∈ R m | g k (x) ≥ 0 for 1 ≤ k ≤ ℓ} under a linear function f : Checking the feasibility of a set constrained by a set of linear inequalities can be solved by a linear program solver [48].
ReLU networks. If each perception function obs i is implemented via a ReLU NN working as a classifier, where the activation function is B-PWL, then the pre-images of the ReLU NN for each percept [46] have linear boundaries, and therefore all regions in the corresponding perception BFCP Φ P can be represented by polytopes (see Example 4). If there exist polytope constant-BFCPs for B-PWC r A ( · , α) and r S for all α ∈ A, then all regions in Φ α R for α ∈ A are polytopes. If δ E ( · , α) is piecewise linear and invertible and ϕ ′ is a polytope (line 5 in Algorithm 2), then {s ∈ ϕ | Θ α s ∩ ϕ ′ ̸ = ∅} is a polytope. Therefore, each region in Φ pre is a polytope after every iteration and the operations over polytopes, including intersections, image and preimage computations, directly follow from the computation above.
Example 5. We now return to the NS-CSG model, presented in Example 1 of a dynamic vehicle parking problem with the perception functions implemented via the linear regression model given in Example 3. To demonstrate implementability of our approach we synthesise strategies using a prototype Python implementation of the B-PWC VI algorithm.
The implementation uses a polyhedral representation of regions and the values of the zero-sum normal-form games involved in the minimax operator at step 8 of Algorithm 1 are found by solving the corresponding linear program [47] using the SciPy library [48]. We have partitioned the state space of the game into two sets corresponding to the two possible local states of Ag 1 . The B-PWC VI algorithm converges after 46 iterations when ε = 10 −6 and takes 3, 825s to complete. For each set in the partition of the state space, the BFCP of this set converges to the product of two 8 × 8 grids. For the current chosen parking spot of Ag 1 (red square) and coordinate of Ag 2 (purple triangle), the value function with respect to the coordinate of Ag 1 is presented in Fig. 4 (left) and shows that, the closer Ag 1 is to its chosen parking spot, the higher the (approximate) optimal value. The lightest-colour class is caused by an immediate crash, and its position follows from the observation function.
An (approximate) optimal strategy for Ag 1 is presented in Fig. 4 (right), where the colour of an arrow is proportional to the probability of moving in that direction and the rotating arrow represents the parking action. There are several choices which are not intuitive. For example, although a crash cannot be avoided before reaching its current parking spot, Ag 1 moves left when in

Policy iteration
It is known that, for MDPs, PI algorithms generally converge faster than VI algorithms, since policy improvement can jump over policies directly [49]. Motived by this fact, in this section we show how PI can be used to approximate the values and optimal strategies of an NS-CSG C with respect to a discounted accumulated reward objective Y . Our algorithm takes ideas from recent work [17], which proposed a new PI method to solve zero-sum stochastic games with finite state spaces, and is the first PI algorithm for CSGs with Borel state spaces and with a convergence guarantee. Our PI algorithm ensures that the strategies and value functions generated during each iteration never leave a finitely representable class of functions. In addition, when computing values of CSGs, efficiencies are gained over alternative algorithms as there is no need to solve normal-form games, which is required by our B-PWC VI and Pollatschek-Avi-Itzhak's PI algorithm [25], nor to solve MDPs, which adds complexity to Hoffman-Karp's PI algorithm [24]. This results in cheaper computations and faster convergence over these alternatives, as for PI over VI for MDPs.

Operators, functions and solutions
Before presenting the algorithm, the following operators, functions and solutions are proposed. Let γ ∈ R be a constant such that γ > 1 and γβ < 1, which will be used to distribute the discount factor β between policy evaluation and policy improvement of the two agents.
Operators for Max-Min and Min-Max. Before introducing operators for Max-Min and Min-Max, we require the notion of a stationary Stackelberg (follower) strategy for Ag 2 , which is a stochastic kernel σ 2 : Λ 1 → P(A 2 ), i.e., σ 2 ∈ P(A 2 | Λ 1 ) such that σ 2 (A 2 (s) | (s, u 1 )) = 1 for (s, u 1 ) ∈ Λ 1 . This strategy is introduced only for the PI algorithm and implies that Ag 2 makes decisions conditioned on the current state s and the current choice of Ag 1 , i.e. action distribution u 1 , and thus allows us to split the maximum and minimum operations of the two agents. We denote by Σ 2 the set of all stationary Stackelberg strategies for Ag 2 .
Unlike the classical PI algorithms by Hoffman and Karp [24] and Pollatschek and Avi-Itzhak [25], following [17], our PI algorithm separates the policy evaluation and policy improvement of the maximiser (Ag 1 ) and the minimiser (Ag 2 ) through the use of the operators of Definition 11 and Definition 12, respectively. To track the value functions after performing policy evaluation of Ag 1 and Ag 2 , our PI algorithm introduces value functions J 1 and J 2 . In addition, the value functions V 1 and V 2 are introduced to avoid the oscillatory behavior of the Pollatschek and Avi-Itzhak PI algorithm [25], thus ensuring convergence, and are updated only during policy improvement. The role of γ is to split the discount factor β such that all the operators corresponding to policy evaluation and policy improvement of the two agents are contraction mappings, which then ensures convergence.
Two function representations. We next define two classes of functions, which play a key role in characterizing the functions and strategies generated during each iteration of our PI algorithm.
Definition 14 (CON-PWC stochastic kernel). A function f ∈ Σ 2 is a constant-piecewise-constant (CON-PWC) stochastic kernel if there exists a BFCP Φ of S such that, for each ϕ ∈ Φ, A(s) = A(s ′ ) for s, s ′ ∈ ϕ, and Φ generates Θ = {θ(ϕ) | ϕ ∈ Φ} where θ(ϕ) = {(s, u 1 ) ∈ Λ 1 | s ∈ ϕ}, a BFCP of Λ 1 , such that for θ(ϕ) ∈ Θ: (i) f ( · , u 1 ) : ϕ → P(A 2 (s)) is constant for u 1 ∈ P(A 1 (s)) where s ∈ ϕ; (ii) f (s, · ) : P(A 1 (s)) → P(A 2 (s)) is B-PWC for s ∈ ϕ. Fig. 5 presents an example of a CON-PWL Borel measurable function and CON-PWC stochastic kernel over a region. We now show that these two functions can be represented by finite sets of vectors. Each CON-PWL Borel measurable function f can be represented by a finite set of vectors where Φ is a BFCP of S for f using Definition 13 and Φ ′ (ϕ) is a BFCP of {u 1 ∈ P(A 1 ) | (s, u 1 ) ∈ θ(ϕ)}, and θ(ϕ) ∈ Θ again using Definition 13 is such that, over each region ϕ ′ ∈ Φ ′ (ϕ), f (s, u 1 ) is linear in u 1 given s ∈ ϕ. Similarly using Definition 14, each CON-PWC stochastic kernel f can be represented by a finite set of vectors Maximum or minimum solutions. We introduce a criterion for selecting the maximum or minimum solution over a region, by which the strategies from policy improvement are finitely representable.

Minimax-action-free PI
We now use the operators of Definitions 11 and 12, together with the functions and solutions from Definitions 13, 14, 15 and 16 to derive a PI Algorithm 3 Iteration t of Minimax-action-free PI 1: Input: Perform one of the following four iterations.

3:
Policy evaluation of Ag 1 : Policy improvement of Ag 1 by CON-1 solution:

13:
Policy improvement of Ag 2 by CON-2 solution: 14: algorithm called Minimax-action-free PI (Algorithm 3) for strategy synthesis for NS-CSGs with Borel state spaces. Our algorithm closely follows the PI ⋆ method of [17] for finite state spaces, but has to resolve a number of issues due to the uncountability of the underlying state space and the need to ensure Borel measurability at each iteration. To overcome these issues we (i) introduce CON-PWL Borel measurable functions and CON-PWC Borel measurable strategies to ensure measurability and finite representability; (ii) work with CON-1 and CON-2 solutions for policy improvement to ensure that the strategies generated are finitely representable and consistent; and (iii) propose a BFCP iteration algorithm (Algorithm 4) and a BFCP-based computation algorithm (Algorithm 5) to compute a new BFCP of the state space and the values or strategies over this BFCP. We also provide a simpler proof than that presented in [17], which does not require the introduction of any new concepts except those used in the algorithm.

21:
Take u ′ 1 ∈ ϕ u and u ′ 2 ∈ argmin Initialization. The Minimax-action-free PI algorithm is initialized with strategies σ 0 1 and σ 0 2 for each player, which are uniform distributions over available actions/state-action pairs, i.e., σ 0 1 (s) = 1 |A 1 (s)| for all s ∈ S and σ 0 2 (s, u 1 ) = 1 |A 2 (s)| for all (s, u 1 ) ∈ Λ 1 , and four 0-valued functions, J 0 1 , V 0 1 , J 0 2 V 0 2 , i.e., J 0 1 (s) = V 0 1 (s) = 0 for all s ∈ S and J 0 2 (s, u 1 ) = V 0 2 (s, u 1 ) = 0 for all (s, u 1 ) ∈ Λ 1 , and Algorithm 4 gives one BFCP for each strategy and function, The algorithm. An iteration of the Minimax-action-free PI is given in Algorithm 3. As shown later, the order and frequency by which the possible four iterations of Algorithm 3 are run do not affect the convergence, as long as each is performed infinitely often. This permits an asynchronous implementation of the Minimax-action-free PI algorithm, as discussed in [17] and for its single-agent counterparts in [50].
For each of the four iterations, Algorithm 4 provides a way to compute new BFCPs and the results below demonstrate that, over each region of these BFCPs, the corresponding computed strategies and value functions are either constant, PWC or PWL. Therefore, we can follow similar steps to our VI algorithm (see Algorithm 1) to compute the value functions of these new strategies and value functions (see Algorithm 5). The idea is to first compute the BFCPs Φ J t+1 and Θ σ t+1 2 via Algorithm 4 and then use them to compute strategies and value functions using Algorithm 5. For instance, if policy improvement of Ag 2 is chosen at iteration t ∈ N then we proceed as follows. First, new BFCPs are computed via Algorithm 4. Second, procedure PI2 of Algorithm 5 is performed. In this step we take each region θ ∈ Θ σ t+1 2 , let ϕ = {s | (s, u 1 ) ∈ θ}, then take one state s ′ ∈ ϕ, and compute a BFCP Φ u of P(A 1 (s ′ )) such that min u 2 ∈P(A 2 (s ′ )) [H 2 is constant over ϕ u ∈ Φ u and for u 1 ∈ ϕ u . Third, take one u ′ 1 ∈ ϕ u and find u ′ 2 ∈ P(A 2 (s ′ )) that minimises [H 2 by Lemma 9 and V t+1 2 (s, u 1 ) is CON-linear in s ∈ ϕ and u 1 ∈ ϕ u . Finally, we copy the other strategies and value functions for the next iteration.
Representation closures. The following lemmas show the strategies and value functions generated during each iteration of the Minimax-action-free PI algorithm are closed under B-PWC, CON-PWL and CON-PWC functions, and are thus finitely representable.
Lemma 6 (Evaluation closure for Ag 1 ). If σ t 1 ∈ Σ 1 is a PWC stochastic kernel, J t 2 , V t 2 ∈ F(Λ 1 ) are CON-PWL Borel measurable and policy evaluation of Ag 1 is performed (procedure PE1 ), then J t+1 Proof. Suppose σ t 1 ∈ Σ 1 is a PWC stochastic kernel and J t 2 , V t 2 ∈ F(Λ 1 ) are CON-PWL Borel measurable. Since σ t 1 is a PWC stochastic kernel, there exists a constant-BFCP Φ σ t 1 of S for σ t 1 . Since J t 2 is a CON-PWL Borel measurable function, there exists a BFCP Φ J t 2 of S satisfying the properties of Definition 13 for J t 2 . Therefore J t 2 (s, σ t 1 (s)) is constant on each region of the BFCP Φ σ t 1 + Φ J t 2 . We can similarly show that V t 2 (s, σ t 1 (s)) is constant on each region of the BFCP Consider the policy evaluation of Ag 1 (procedure PE1 ). Using Definition 11 we have that J t+1 is also bounded as required. □ Lemma 7 (Improvement closure for Ag 1 ). If J t 2 , V t 2 ∈ F(Λ 1 ) are CON-PWL Borel measurable and policy improvement of Ag 1 is performed (procedure PI1 ), then σ t+1 Proof. Suppose J t 2 , V t 2 ∈ F(Λ 1 ) are CON-PWL Borel measurable functions. Using [42,Chapter 18.1] and Definition 13 it follows that the function K t := min[J t 2 , V t 2 ] is Borel measurable. Note that, over each region of Φ J t 2 + Φ V t 2 , K t (s, u 1 ) is constant in s given u 1 , and PWL in u 1 given s (where Φ J t 2 and Φ V t 2 are from Lemma 6), and therefore K t is CON-PWL. Let Φ K t = Φ J t 2 + Φ V t 2 and Θ K t be a BFCP of Λ 1 satisfying the properties ⋆ of Definition 13 for K t . Every state in each region of the BFCP Φ K t has the same set of available actions for Ag 1 and same strategy u 1 that maximises K t (s, u 1 ) on a region of Θ K t . Therefore, using the CON-1 solution in Definition 15, the strategy of Ag 1 : Lemma 8 (Evaluation closure for Ag 2 ). If J t 1 , V t 1 ∈ F(S) are B-PWC and σ t 2 ∈ Σ 2 is a CON-PWC stochastic kernel and policy evaluation of Ag 2 is performed (procedure PE2 ), then J t+1 Proof. Suppose J t 1 and V t 1 are B-PWC and σ t 2 ∈ Σ 2 is a CON-PWC stochastic kernel. Using [42,Chapter 18.1] it follows that γ max[J t 1 , V t 1 ] is B-PWC. In view of the B-PWC function Q( · , α, V ) in Theorem 2, for each α ∈ A the function:Q t α (s) := Q(s, α, γ max[J t 1 , V t 1 ]) is B-PWC. Let ΦQ t be a BFCP of S such thatQ t α is constant on each region of ΦQ t for α ∈ A. It follows that A(s) is constant on each region of ΦQ t .
Next, let Φ σ t 2 be a BFCP of S satisfying the properties of Definition 14 for the CON-PWC stochastic kernel σ t 2 . For the BFCP ΦQ t + Φ σ t 2 of S, we generate a BFCP Θ t 1 of Λ 1 such that each region θ t 1 (ϕ) ∈ Θ t 1 , induced by a region ϕ ∈ ΦQ t + Φ σ t 2 , is given by θ t 1 (ϕ) = {(s, u 1 ) ∈ Λ 1 | s ∈ ϕ}. Finally, consider the policy evaluation of Ag 2 . According to Definition 12, for (s, is constant in s for a fixed u 1 , and PWL in u 1 for a fixed s ∈ S. Thus, J t+1 2 is CON-PWL. SinceQ t α and σ t 2 are bounded, Borel measurable, then so is J t+1 2 by Definition 12 as required. □ Lemma 9 (Improvement closure for Ag 2 ). If J t 1 , V t 1 ∈ F(S) are B-PWC and policy improvement of Ag 2 is performed (procedure PI2 ), then σ t+1 2 (s, u 1 ) ∈ argmin u 2 ∈P(A 2 (s)) [H 2 Proof. Suppose J t 1 , V t 1 ∈ F(S) are B-PWC. For the BFCP ΦQ t of S, we generate a BFCP Θ t 2 of Λ 1 such that each region θ t 2 (ϕ) in Θ t 2 induced by a region ϕ ∈ ΦQ t is given by θ t where ΦQ t is from the proof of Lemma 8. Consider the policy improvement of Ag 2 (procedure PI2 ). According to Definition 12, by using the CON-2 solution in Definition 16, for (s, u 1 ) ∈ θ t 2 (ϕ), the Stackelberg strategy of Ag 2 : σ t+1 2 (s, u 1 ) ∈ argmin u 2 ∈P(A 2 (s)) [H 2 is constant in s for a fixed u 1 , and PWC in u 1 for a fixed s. Thus, σ t+1 2 is CON-PWC. Since σ t+1 2 is a CON-PWC stochastic kernel, then Lemma 8 implies that V t+1 2 is CON-PWL Borel measurable as required. □ By fusing Lemmas 6, 7, 8 and 9 we can prove that the strategies and value functions generated during each iteration of Algorithm 3 never leave a finitely representable class of functions, and Algorithm 4 constructs new BFCPs such that the strategies and value functions after one iteration of the Minimaxaction-free PI algorithm remain constant, PWC, or PWL on each region of the constructed BFCPs.
Theorem 3 (Representation closure). In any iteration of the Minimaxaction-free PI algorithm (see Algorithm 3), if are CON-PWL Borel measurable and σ t 2 ∈ Σ 2 is a CON-PWC stochastic kernel; and σ t+1 2 , respectively, regardless of which one of the four iterations is performed.

Convergence analysis and strategy computation
We next prove the convergence of the Minimax-action-free PI algorithm by showing that there exists an operator from the product space of the function spaces over which J 1 , V 1 , J 2 and V 2 are defined to itself, which is a contraction mapping with a unique fixed point, one of whose components is the value function multiplied by a known constant. The proof closely follows the steps for finite state spaces given in [17], but is more complex due to the underlying infinite state space and the need to deal with the requirement of Borel measurability and finite representation of strategies and value functions.
Lemma 11 (PWC strategies). If V = γV t 1 , where V t 1 is from iteration t ∈ N of the Minimax-action-free PI algorithm, and (σ 1 , σ 2 ) ∈ Σ achieves the maximum and the minimum in Definition 10 for V and all s ∈ S via a CON-3 solution, then σ 1 and σ 2 are PWC stochastic kernels.
Proof. By Theorems 3 and 4, V is B-PWC. For any α ∈ A, the function Q( · , α, V ) : S → R is B-PWC by Theorem 2. Let Φ Q be a BFCP of S such that Q( · , α, V ) is constant on each region of Φ Q for α ∈ A, and Φ A be a BFCP of S such that A(s) is constant on each region of Φ A . Then, for u 1 ∈ P(A 1 (s)) and u 2 ∈ P(A 2 (s)), the function Q ′ ( · , u 1 , u 2 ) : S → R, where: Q ′ (s, u 1 , u 2 ) = (a 1 ,a 2 )∈A(s) Q(s, (a 1 , a 2 ), V )u 1 (a 1 )u 2 (a 2 ) for s ∈ S, is constant in each region of Φ Q + Φ A . Therefore, there exists a CON-3 solution (σ 1 , σ 2 ) of Q ′ (s, u 1 , u 2 ) and, since Φ Q + Φ A is a BFCP, the result follows. □

Conclusions
We have proposed a novel modelling formalism called neuro-symbolic concurrent stochastic games (NS-CSGs) for representing probabilistic finite-state agents with NN perception mechanisms interacting in a shared, continuousstate environment. NS-CSGs have the advantage of allowing for the perception of a complex environment to be synthesised from data and implemented via NNs, while the safety-critical decision-making module is symbolic, explainable and knowledge-based.
For zero-sum discounted cumulative reward problems, we proved the existence and measurability of the value function of NS-CSGs under Borel measurability and piecewise constant restrictions. We then presented the first implementable B-PWC VI and Minimax-action-free PI algorithms with finite representations for computing the values and optimal strategies of NS-CSGs, assuming a fully observable setting, by proposing B-PWC, CON-PWL and CON-PWC functions. The B-PWC VI algorithm is, at the region level, the same as VI for finite state spaces, but involves, at each iteration, a division of the uncountable state space into a finite set of regions (i.e., a BFCP). The Minimax-action-free PI algorithm requires multiple divisions of the uncountable state space into BFCPs at each iteration; following [17], it ensures convergence and, by not requiring the solution of normal-form games or MDPs at each iteration, reduces computational complexity. However, implementation of the Minimax-action-free PI algorithm is more challenging, requiring a distributed, asynchronous framework.
We illustrated our approach by modelling a dynamic vehicle parking problem as an NS-CSG and synthesising approximately optimal values and strategies using B-PWC VI. Future work will involve improving efficiency and generalising to other observation functions by working with abstractions, extending our methods to allow for partial observability, and moving to equilibria-based (nonzero-sum) properties, where initial progress has been made by building on our NS-CSG model [38].