Online shielding for reinforcement learning

Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given, and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next k steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well-suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game Snake. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.


Introduction
Reinforcement Learning (RL) has proven successful in solving complex tasks that are difficult to solve using classic controller design, including applications in computer games [35], multi-agent planning [40], and robotics [39].RL learns highperformance controllers by optimising objectives expressed via rewards in unknown, stochastic environments.Although learning-enabled controllers (LECs) have the potential to outperform classical controllers, safety concerns prevent LECs from being widely used in real-world tasks [2].
Shielding [5] is a runtime enforcement technique to ensure safe decision making.By augmenting an RL agent with a shield, at every time step, unsafe actions are blocked by the shield and the learning agent can only pick a safe action to be sent to the environment.Shields are automatically constructed via correct-by-construction formal synthesis methods from a model of the safety-relevant environment dynamics and a safety specification.In the deterministic setting, shields ensure that unsafe states specified by the safety specification are never visited.Consequently, in the absence of uncertainties, an agent augmented with a shield is guaranteed to satisfy the safety objective as long as the shield is used.
In scenarios that incorporate uncertainties, probabilistic shields have been used to reduce safety violations during training and execution [20].The premise of this work is that often in real-world applications, many safety violations can be avoided by analysing the consequences of actions in the near future.To compute probabilistic shields, the safety-relevant dynamics of the environment are modelled as a Markov Decision Process (MDP) and the specification ϕ is expressed in a safety fragment of linear temporal logic [4].Such a specification could, for example, forbid to reach a set of critical states in the MDP.
For each state and action, exact probabilities are computed on how likely it is that executing the action results in violating ϕ from the current state within the next k steps.At a state s, an actions a is called unsafe if executing a incurs a probability of violating ϕ within the next k steps greater than a threshold δ with respect to the optimal safety probability possible in s.During runtime, the probabilistic shield then blocks any unsafe action from the agent.A probabilistic shield can be used to shield the decisions of an agent in the training as well as in the execution phase.In this paper, we build on the approach of [20].Hence, from here on, we simply refer to probabilistic shields as shields.
The problem with offline shielding.The computation of an offline shield for discrete-event systems requires an exhaustive, ahead-of-time safety analysis for all possible state-action combinations.Therefore, the complexity of offline shield synthesis grows exponentially in the state and action dimensions, which limits the application of offline shielding to small environments.Previous work that applied shields in complex, high-dimensional environments relied on over-approximations of the reachable states and domain-oriented abstractions [1,3].However, this may result in imprecise safety computations of the shield.This way, the shield may become over-restrictive, hindering the learning agent in properly exploring the environment and finding its optimal policy [20].
Our solution -online shielding.Our approach is based on the idea of computing the safety of actions on-the-fly during run time.In many applications, the learning agent does not have to take a decision at every time step.Instead, the learning agent only has to make a decision when reaching a decision state.As an example consider a service robot traversing a corridor.The agent has time until the service robot reaches the end of the corridor, i.e., the next decision state, to decide where the service robot should go next.Online shielding uses the time between two decision states to compute the safety of all possible actions in the next decision state.When reaching the next decision state, this information is used to block unsafe actions of the agent.While the online safety analysis incurs a runtime overhead, every single computation of the safety of an action is efficient and parallelisable.Thus, in many settings, expensive offline pre-computation and huge shielding databases with costly lookups are not necessary.Since the safety analysis is performed only for decision states that are actually reached, online shielding is applicable to large, changing, or unknown environments.
We address the problem of shielding a controllable RL agent in an environment shared with other autonomous agents that perform tasks concurrently.For example, some combinations of agent positions may be unsafe, as they correspond to collisions.The specification then forbids visiting such states, i.e., ϕ = G(¬S collision ).
In online shielding, the computation of the safety for any action in the next decision state starts as soon as the controllable agent leaves the current decision state.The tricky part of online shielding in the multi-agent setting is that during the time the RL agent has between two consecutive decisions, the other agents also change their positions.Therefore, online shielding requires to compute the safety of actions with respect to all possible movements of the other agents.As soon as the next decision state is reached, the results from the safety analysis are used to block unsafe actions.
Technically, we use MDPs to formalise the dynamics of the agents operating within the environment.At runtime, we create a sub-MDP for each action.These sub-MDPs model the immediate future from the viewpoint of the RL agent.Via model checking, we determine for the next k steps the minimal probability of violating ϕ.An action unsafe and therefore blocked by the shield, if the action violates safety with a probability higher than a threshold δ.We generally set δ relative to the minimal probability of violating ϕ within the next k steps.In some state, any available action may impose a large risk of violating ϕ, while in other states there may exists actions that guarantee to stay safe within the next k steps.Using a relative threshold allows an adaptive notion of shielding and ensures deadlock freedom by always allowing at least one action.
Requirements and Limitations.The online shielding approach relies upon some requirements and is affected by a few limitations that we want to briefly summarise.The first limitation addresses the fact that the proposed approach does not provide any worst-case computation times.Therefore the possibility that the agent reaches the next decision state before the shield made its decision on which action to block cannot be ruled out.Online shielding is therefore only applicable in settings that allow an alternative course of action such as "waiting" if the safety analysis is not completed.In general, we recommend to use online shielding in settings where the average time between visiting decision states is larger than the average time used for the safety analysis.
As a second limitation we list the possible state-space explosion of the individual sub-MDPs.The size of the sub-MDPs depends among others on the used finite horizon k, the number of agents operating within the environment and the number of decision states within the next k steps.In extreme cases where every state is a decision state, online shielding would likely not be applicable, because (1) there would be almost no time between the individual decisions and (2) the state space of the constructed sub-MDPs would explode.
Third, the safety of actions is only analysed within a finite horizon k.Therefore, the agent might end up in situations where any available action induces a high probability of violating the specification.It is therefore important to pick a finite horizon k large enough to prevent such situations.The minimal size of k needed to prevent many safety violations depends on the concrete setting and there is a natural trade-off between the computational overhead for the safety-analysis and the number of safety violations that can be prevented by the shield.As a reference value, we recommend to use a finite horizon k larger than the number of steps between any two adjacent decision states.
Contributions.This paper is an extended version of [23], in which we gave the formalisation of online shielding in probabilistic environments and presented experimental results of shielding a simple tabular Q-learning agent for a 2-player version of the classical computer game Snake1 .The evaluation demonstrated that shields can be efficiently computed at runtime, guarantee safety, and have the potential to positively influence learning performance.
The novelty of this work with respect to [23] is an extensive case study that focuses on the question: Can we use online shielding to safely learn a safe policy?
An unshielded agent learns about the safety of actions by exploring actions and receiving negative rewards if an action is unsafe.In case of standard shielded learning, all unsafe actions are blocked from the agent and the agent never gets the chance to learn the safety constraints from the negative rewards.Since an unsafe action may look like a safe alternative to the RL agent due to the absence of negative rewards, unsafe actions might very likely be part of the final policy.To learn a safe policy under shielding, we propose to apply informed shields.These shields update the value function of the agent with negative rewards for any action that is blocked by the shield, using different thresholds for the comparison of the safety values.By investigating unshielded learning and learning with informed and uninformed shields in our case study, we empirically assess how shields affect convergence to an optimal and safe policy.Since learning a safe policy is only possible if the reward structure captures all relevant safety constraints [18], uninformed shields that merely block actions may hinder learning a safe policies.
We performed the case study on a Deep Qlearning agent for 2-Player Snake to study the safety of the final policies in various settings.We compared the final policies of unshielded agents and agents augmented with informed and uninformed shields.The results show that shielding leads to a better performance during learning in all cases.However, in most experiments, the final policies obtained by shielded learning produce more unsafe behaviour than the policies learned by unshielded agents.This happens with both uninformed and informed shields that provide negative rewards for blocked actions.These results suggest that in shielded learning, the shield is also needed in the execution phase.
Outline.The rest of the paper is structured as follows.Section 1.1 discusses related work.We discuss the relevant foundations in Section 2. In Section 3, we present the setting and formulate the problem that we address.We present online shielding in Section 4, by defining semantics for autonomous agents in the considered setting and defining online shield computations based on these semantics.In Section 5, we report on the evaluation of online shielding for the classic computer game Snake.Section 6 concludes the paper with a summary and an outlook on future work.

Related Work
In reinforcement learning (RL) [36], an agent aims to compute an optimal policy that maximizes the expected total amount of reward through trialand-error via interactions with an unknown environment.While exploring unknown state-action pairs, RL agents that are agnostic to safety may undeliberately execute unsafe actions.Safe RL algorithms aim to guarantee safety even during exploration, at least with high probability.
Several recent works [15,16] considered logically-constrained RL, which employs temporal logic as formal reward shaping technique.The final policy of an agent trained with such a reward structure maximises the probability of satisfying the specified formula.In the context of safe RL, the formula can express a safety property and the trained agent will minimise the risk of violating the property.In order to ensure safety during training, logically-constrained RL has to be extended by restricting the exploration during training [17].
We follow a model-based approach for safe RL.In the continuous domain, model-based approaches have utilised Lyapunov-based methods [8] and control-barrier functions [7,28] to enable safe learning under continuous system dynamics.Li et al. [26] proposed model predictive shielding (MPS) for continuous systems.Given an optimal policy and a safe policy, MPS checks online before executing the next action of the optimal policy whether the new state would allow reaching an invariant safe state within the next k steps when executing the safe policy.If not, MPS does not execute the action of the optimal policy and switches to the safe policy instead.In the control community, this architecture is also known as the simplex architecture and gave rise to several recent applications in RL [19].In this architecture, a switching logic switches from an advanced controller (the RL agent) to a verified base controller as soon as a state would be visited that is outside of the region where the base controller is guaranteed to satisfy an invariant safety property.In the general shielding setting, shielding supports safety specifications beyond invariant properties, including, for example, bounded liveness properties.Furthermore, compared to MPS and the simplex architecture, shielding interferes less with the agent.Instead of overwriting an unsafe action chosen by the agent with a particular safe one, a shield lets the agent choose any action as long as it is safe.
Fulton et al. [11] published the first work on verifiable safe learning for hybrid systems.In this work, a theorem prover is used to prove the correctness of a model with respect to a safety specification given in differential dynamic logic.This work was extended in [12] to the setting in which a single accurate model is not known at design time.The authors propose an approach in which multiple environmental models that are provably correct are constructed at design time.During runtime, based on the collected data, the approach selects between the available models.A major drawback of the approaches from [11,12] is that they only learn control policies over handcrafted symbolic state spaces.This limitation was addressed in [18].Prior to RL, an agent is trained to detect positions of safety-critical objects from visual data.During RL, this information is then used to enforce formal safety constraints that take noise from the object detection systems into account.
In the discrete domain, the shielding approach is commonly used in safe RL [1,22].In the deterministic setting, shields are usually constructed offline by computing a maximally permissive policy containing all actions that will not violate the safety specification on the infinite horizon.Jansen et al. [20] introduced offline shielding in probabilistic environments, considering safety within a finite horizon.Giacobbe et al. [14] used a very similar approach to shield safety properties in Atari games.Our work directly extends the approach to shielding by Jansen et al. [20] to the online setting.Several resent extensions of shielding in probabilistic environments exist, like shielding under partial observability [6], shielding quantitative properties [32], or shielding multi-agent systems [10].In 2021, the shield synthesis tool TEMPEST was published, which is able to synthesise several different notions of shields proposed in literature for probabilistic environments [31].
Novelty of our work.All model-based safe RL approaches discussed in this section rely on building the model (or several models) at design time and exhaustively analysing the safety of actions within the model at design time.Our approach builds the environmental models for a finite horizon at runtime and performs the safety verification online.This allows our approach to be applied on large, at design time unknown, and changing environments.
To make the safety verification at runtime possible, we are the first to compute the safety of actions before a decision state is visited, by considering all possible movements of adversarial agents in the computations.By analysing the actions before a decision state is reached, we prevent delays at runtime.

Preliminaries
Sequence and Tuple Notation.We denote sequences of elements by t = e 0 • • • e n with denoting the empty sequence.The length of t is denoted as |t| = n + 1.We use t[i] = e i for 0-based indexed access on tuples and sequences.The notation t[i ← e i ] represents overwriting of the i th element of t by e i , that is, t A probability distribution over a countable set A Markov decision process (MDP) M = (S, s 0 , A, P) is a tuple with a finite set S of states, a unique initial state s 0 ∈ S, a finite set A = {a 1 . . ., a n } of actions, and a (partial) probabilistic transition function P : S × A → Distr(S), where P(s, a) = ⊥ denotes undefined behaviour.For all s ∈ S the available actions are Non-deterministic choices in an MDP are resolved by a so-called policy.For the properties considered in this paper, memoryless deterministic policies are sufficient [4].These are functions π : S → A with π(s) ∈ A(s).We denote the set of all memoryless deterministic policies of an MDP by Π. Applying a policy π to an MDP yields an induced Markov chain D = (S, s I , P ) with P : S → Distr (S) where all nondeterminism is resolved.A reward function r : S × A → R for an MDP adds a reward to every state s and action a.
In formal methods, safety properties are often specified as linear temporal logic (LTL) formulas [30].For an MDP M, probabilistic model checking [21,24] employs value iteration or linear programming to compute the probabilities of all states and actions of the MDP to satisfy a safety property ϕ.
Specifically, we compute η max ϕ,M : S → [0, 1] or η min ϕ,M : S → [0, 1], which yields for all states the maximal (or minimal) probability over all possible policies to satisfy ϕ.For instance, for ϕ encoding to reach a set of states T , η max ϕ,M (s) is the maximal probability to "eventually" reach a state in T from state s ∈ S.

Setting
We consider a setting similar to [20], where one controllable agent, called the avatar, and multiple uncontrollable agents, called adversaries operate within an arena.The arena is a compact, highlevel description of the underlying model and captures the dynamics of the agents.Any information on rewards is neglected within the arena since it is not needed for safety computations.
From this arena, potential agent locations may be inferred.Within the arena, the agents perform tasks that are sequences of activities performed consecutively.
Formally, an arena is a pair G = (V, E), where V is a set of nodes and E is a finite set of E ⊆ V × V .An agent's location is defined via the current node v ∈ V .An edge (v, v ) ∈ E represents an activity of an agent.By executing an activity, the agent moves to its next location v .A task is defined as a non-empty sequence To ease representation, we denote tasks also as sequences of locations The set of tasks available in a location v ∈ V is given by the function Task(v).The set of all tasks of an arena G is denoted by Task(G).The avatar is only able to select a next task at a decision location in V D ⊆ V .To avoid deadlocks, we require for any decision location , any task ends in another decision location from which the agent is able to decide on a new task.A safety property may describe that some combinations of agent positions are unsafe and should not be reached (or any other safety property from the safety fragment of LTL).
Example 1 (Gridworld).Figure 1 shows a simple gridworld with corridors represented by white tiles and walls represented by black tiles.A tile is defined via its (x, y) position.We model this gridworld with an arena G = (V, E) by associating each white tile with a location in V and creating an edge in E for each pair of adjacent white tiles.Corners and crossings are decision locations, i.e., V d = {(1, 1), (1,3), (1,5), (5, 1), (5, 3), (5, 5)}.At each decision location, tasks define sequences of activities needed to traverse adjoining corridors, e.g., T ask(

Problem Statement
Consider an environment described by an arena as above and a safety specification ϕ.We assume stochastic behaviours for the adversaries, e.g, obtained using RL [33,34].In fact, this stochastic behaviour determines all actions of the adversaries via probabilities.The underlying model is then an MDP: the avatar executes an action, and upon this execution, the next exact positions (the state of the system) are determined stochastically.That is, the states correspond to the possible positions of all agents including the avatar and the actions correspond to the available tasks.
Our aim is to shield unsafe actions from the avatar during training as well as during execution.At any state s, an actions a is called unsafe if executing a incurs a probability of violating ϕ within the next k steps greater than a threshold δ that is defined with respect to the optimal safety probability possible in s.
The safety analysis of actions is performed onthe-fly allowing the avatar to operate within large arenas.
Example 2 (Gridworld).In Figure 1, the tile labelled A denotes the location of the avatar and the tile labelled E denotes the position of an adversary.Let (x A , y A ) and (x E , y E ) be the positions of the avatar and the adversary, respectively.A safety property in this scenario is ϕ = G(¬(x A = x E ∧ y A = y E )).The "globally" operator G states that unsafe states must not be entered, i.e., that collisions are never allowed.The shield blocks unsafe actions that would increase the probability of violating ϕ within the next k steps by more than a relative threshold δ.We give more details in Section 4.3 on how to construct a shield for this setting.

Online Shielding for MDPs
In this section, we outline the workflow of online shielding in Figure 2 and describe it below.Given an arena and behaviour models for adversaries, we define an MDP M that captures all safetyrelevant information.At runtime, we use current runtime information to create sub-MDPs M of M that model the immediate future of the agents up to some finite horizon.Given such a sub-MDP M and a safety property ϕ, we compute via model checking the probability to violate ϕ within the finite horizon for each task available.The shield then blocks tasks involving a too large risk from the avatar.To ensure effectiveness, we choose the horizon large enough such that it covers the distance between any pair of adjacent decision locations, i.e., pairs of locations connected by a task.

Behaviour Models for Adversaries
The adversaries and the avatar operate within a shared environment, which is represented by an arena G = (V, E), and perform tasks independently.We assume that we are given a stochastic behaviour model of each adversary that determines all task choices of the respective adversary via probabilities.The behaviour of an adversary is formally defined as follows.
Definition 1 (Adversary Behaviour).For an arena G = (V, E), we define the behaviour B i of an adversary i as a function B i : V D → Distr (Task(G)) from decision locations to distributions over tasks, with supp Behaviour models of adversaries may be derived using domain knowledge or generalised from observations using machine learning or automata learning [27,37,38].A potential approach is to observe adversaries in smaller arenas and transfer knowledge gained in this way to larger arenas [20].Cooperative and truly adverse behaviour of adversaries may require considering additional aspects in the adversary behaviour, such as the arena state at a specific point in time.Such considerations are beyond the scope of this paper, since complex adversary behaviour generally makes the creation of behaviour models more difficult, whereas the online shield computations are hardly affected.For example, historydependent adversary behaviour would require a different definition of sub-MDPs M , but such behaviour would not affect the size of sub-MDPs.Hence, it would lead to similar computation times.The MDP size and the computation times generally depend on the number of decision locations, the horizon, and the number of adversaries.

Safety-Relevant MDP M
In the following, we describe the safety-relevant MDP M underlying the agents operating within an arena.This MDP includes non-deterministic choices of the avatar and stochastic behaviour of the adversaries.Note that the safety-relevant MDP M is never explicitly created for online shielding, but is explored on-the-fly for the safety analysis of tasks.We follow the presentation by Jansen et al. [20] for this purpose.
Let G = (V, E) be an arena, let Task be a task function for G, let B i with i ∈ {1 . . .m} be the behaviour functions of m adversaries, and let the avatar be the zeroth agent.The safetyrelevant MDP M = (S, s 0 , A, P) models the arena and agents' dynamics as follows.Each agent has a position and a task queue containing the activities to be performed from the last chosen task.The agents take turns performing activities from their respective task queue.If the task queue of an agent is empty, a new task has to be selected.Since we control the avatar, we model its choice of tasks as non-deterministic choice from all available tasks.Therefore, we analyse the outcomes of carrying out all possible tasks.By Definition 1 the adversary behaviour is probabilistic, i.e., the adversaries choose tasks according to a discrete probability distribution.
Hence, M has three types of states: (1) states where the avatar's task queue is empty and the avatar makes a non-deterministic decision on its next task, (2) states where an adversary's task queue is empty and the adversary selects its next task probabilistically, and (3) states where the currently active agent has a non-empty task queue and the agent processes its task queue deterministically.
Formally, the states S = V m+1 × (E * ) m+1 × {0, . . ., m} are triples s = (v, q, t) where v encodes the agent positions, q encodes the task queue states of all agents, and t encodes whose turn it is.This implies that if s D ∈ S D , then pos(s D )[ava] is a decision location in V D .A policy for M needs to define actions only for states in S D , thereby defining the decisions for the avatar.All other task decisions in states s, where turn(s) = ava, are performed stochastically by adversaries and cannot be controlled.
At run-time, in each turn each agent performs two steps: (1) If its task queue is empty, the agent has to select its next task and adds it to the task queue.
(2) The agent performs the next activity of its current task queue.

Selecting a New Task
A new task has to be selected in all states s with turn(s) = i and task(s)[i] = , i.e, it is the turn of agent i and agent i's task queue is empty.If i = ava, the avatar is in a decision state s ∈ S D , with actions A(s) = Task(pos(s)[ava]).For each task t ∈ A(s), there is a successor state s with task(s ) = task(s)[ava ← t], pos(s ) = pos(s), turn(s ) = turn(s), and P(s, t, s ) = 1.Thus, there is a transition that updates the avatar's task queue with the edges of task t with probability one.Other than that, there are no changes.
If i = ava, an adversary makes a decision, thus A(s) = α adv .For each t ∈ Task(pos(s)[i]), there is a state s with task(s ) = task(s)[i ← t], pos(s ) = pos(s), turn(s ) = turn(s), and P(s, α adv , s ) = B i (pos(s)[i])(t).There is a single action with a stochastic outcome determined according to the adversary behaviour B i .

Performing Activities
After potentially selecting a new task, the task queue of agent i is non-empty.We are in a state s , where task(s

Sub-MDP M for Next Decision
The idea of online shielding is to compute the safety value of actions in the decision states on the fly and block actions that are too risky.For infinite horizon properties, the probability to violate safety, in the long run, is often one and errors stemming from modelling uncertainties may sum up over time [20].Therefore, we consider safety relative to a finite horizon such that the action values (and consequently, a policy for the avatar) carry guarantees for the next several steps.Explicitly constructing an MDP M as outlined above yields a very large number of decision states that may be infeasible to check.The finite horizon assumption allows us to prune the safety-relevant MDP and construct small sub-MDPs M capturing the immediate future of individual decision states.
More concretely, we consider runtime situations of being in a state s t , the state visited immediately after the avatar decided to perform a task t.In such situations, we can use the time required to perform t for shield computations for the next decision.We create a sub-MDP M by determining all states reachable within a finite horizon and use M to check the safety probability of each action (task) available in the next decision and block unsafe actions.

Construction of M
Online shielding relies on the insight that after deciding on a task t, the time required to complete t can be used to compute a shield for the next decision.Thus, we start the construction of the sub-MDP M for the next decision location v D from the state s t that immediately follows a decision state s D , where the avatar has chosen a task t ∈ A(s D ).The MDP M is computed with respect to a finite horizon h for v D .
By construction, the task is of the form t = v D • • • v D , where v D is the avatar's current location and v D is the next decision location.While the avatar performs t to reach v D , the adversaries perform arbitrary tasks and traverse |t| edges, i.e., until v D is reached only adversaries make decisions.This leads to a set of possible next decision states.We call these states the first decision states S FD ⊆ S D .After reaching v D , both avatar and adversaries decide on arbitrary tasks and all agents traverse h edges.This behaviour defines the structure of M .
Given a safety-relevant MDP M = (S, s 0 , A, P), a decision state s D and its successor s t with task(s t )[ava] = t, and a finite horizon h ∈ N representing a number of turns taken by all agents following the next decision.These turns and the (stochastic) agent behaviour leading to the next decision are modelled by the sub-MDP M .M = (S , s 0 , A , P ) is formally constructed as follows.The actions are the same as for M, i.e., A = A. The initial state is given by s 0 = (s t , 0).The states of M are a subset of M's states augmented with the distance from s 0 , i.e., S ⊆ S ×N 0 .The distance is measured in terms of the number of turns taken by all agents.
We define transitions and states inductively by: ( 1 Movements of the last of m + 1 agents increase the distance from the initial state.Combined with the fact that every movement action increases the agent index and every decision changes a task queue, we can infer that the structure of M is a directed acyclic graph.This enables an efficient probabilistic analysis. By construction, it holds that for every state (s, d) ∈ S with d < |t|, s is not a decision state of M. The set of first decision states S FD consists of all states s FD = (s, |t|) such that s FD ∈ S with task(s)[ava] = and turn(s) = ava, i.e., all first decision states reachable from the initial state of M .We use Task(S FD ) = {t | s ∈ S FD , t ∈ A(s)} to denote the tasks available in these states.M does not define actions and transitions from states (s, |t| + h) ∈ S , as their successor states are beyond the considered horizon h.We have A((s, d)) = ∅ for all states at distance d < |t| + h from the initial state.

Shield Construction
The probability of reaching a set of unsafe states T ∈ S from any state in the safety-relevant MDP should be low.In the finite horizon setting, we are interested in bounded reachability from decision states s D ∈ S D within the finite horizon h.We concretely evaluate reachability on sub-MDPs M and use T = {(s, d) ∈ S | s ∈ T } to denote the unsafe states that may be reached within the horizon covered by M .The property ϕ = ♦T encodes the violation of the safety constraint, i.e., eventually reaching T within M .The shield needs to limit the probability to satisfy ϕ.
Given a sub-MDP M and a set of first decision states S FD .For each task t ∈ T ask(S FD ), we evaluate t with respect to the minimal probability to satisfy ϕ from the initial state s 0 when executing t that is given by η min ϕ,M (s 0 ).This is formalised with the notion of task-valuations below.

Definition 3 (Task-valuation).
A task-valuation for a task t in a sub-MDP M with initial state s 0 and first decision states S FD is given by with val M (t) = η min ϕ,M (s 0 ), and A(s FD ) = {t} for each s FD ∈ S FD .
The optimal task-value for M is optval M = min t ∈Task(SFD) val M (t ).
A task-valuation is the minimal probability to reach an unsafe state in T from each immediately reachable decision state s FD ∈ S FD weighted by the probability to reach s FD .When the avatar chooses an optimal task t (with val M (t) = optval M ) as next task in a state s FD , optval M can be achieved if all subsequent decisions are optimal as well.
We now define a shield for the decision states S FD in a sub-MDP M using the task-valuations.Specifically, a shield for a threshold δ ∈ [0, 1] determines a set of tasks available in S FD that are δ-optimal for the specification ϕ.All other tasks are "shielded" or "blocked".Definition 4 (Shield).For task-valuation val M and a threshold δ ∈ [0, 1], a shield for S FD in M is given by Intuitively, δ enforces a constraint on tasks that are acceptable w.r.t. the optimal probability.The shield is adaptive with respect to δ, as a high value for δ yields a stricter shield, a smaller value a more permissive shield.In particularly critical situations, the shield can enforce the decision maker to resort to (only) the optimal actions w.r.t. the safety objective.This can be achieved by temporarily setting δ = 1.Online shielding creates shields on-the-fly by constructing sub-MDPs M and computing task-valuations for all available tasks.
Through online shielding, we transform the safety-relevant MDP M into a shielded MDP with which the avatar interacts (which is never explicitly created) that is obtained from the composition of all sub-MDPM .Due to the assumption on the task functions that requires a non-empty set of available tasks in all decision locations and due to the fact that every decision for shielding is defined w.r.t. an optimal task, the shielded MDP is deadlock-free.The deadlock-freedom follows from using a relative threshold ensuring that the safest action never gets blocked, thus at least one action is always available.Using task valuations as a basis, our notion of online shielding guarantees optimality with respect to safety.
By using the minimal probability as task valuation val M (t), we assume that the avatar performs optimally with respect to safety in upcoming decisions.This means that to compute task valuations we assume that the agent always chooses the safest action that is available.Alternative definitions that are less optimistic would be possible as well.Defining task valuations as the maximal probability to violate safety would yield stricter guarantees.The corresponding shield would block each action a if there exists any future behaviour following a that violates safety with a too large probability.
Instead of a relative threshold, we may use a fixed absolute threshold λ ∈ [0, 1] such that only tasks t with val M (t) ≤ λ are allowed.Since this may induce deadlocks in case there are no sufficiently safe actions, we fall back to shielding with δ = 1, which defines the threshold as the valuation of the safest action.That is, if there are no λ-safe actions, we use the safest action available.Hence, we either enforce a limit on unsafe behaviour or use the safest option available, which avoids deadlocks.

Optimisation -Updating Shields after Adversary Decisions
After the avatar decides on a task, we use the time to complete the task to compute shields based on task-valuations (see Definition 3 and Definition 4).Such shield computations are inherently affected by uncertainties stemming from stochastic adversary behaviour.These uncertainties consequently decrease whenever we observe a concrete decision from an adversary that we considered stochastic in the initial shield computation.An optimisation of the online shielding approach is to compute a new shield after any decision of an adversary, if there is enough remaining time until the next decision location.Suppose that after visiting a decision state, we computed a shield based on M .While moving to the next decision state, an adversary decides on a new task and we observe the concrete state s.We can now construct a new sub-MDP M using s 0 = s as initial state, thereby resolving a stochastic decision between the original initial state s 0 and s 0 .Using M , we compute a new shield for the next decision location.
The facts that the probabilistic transition function of M does not change during updates and that we consider safety properties enable a very efficient implementation of updates.For instance, if value iteration is used to compute task-valuations, we can simply change the initial state and reuse computations from the initial shield computation.Note that if a task is completely safe, i.e., tasks with a valuation of zero, the value of this task will not change under a re-computation, since the task is safe under any sequence of adversary decisions.

Implementation and Experiments
We consider several experiments of shielded reinforcement learning for a 2-player version of the classic game Snake.We picked this game because it requires fast decision making during runtime, and provides an intuitive and fun setting to show the potential of shielding such that it can potentially be used for teaching formal methods.Furthermore the game is interesting for shielding, since the agent has to experience risky situations in order to win the game.This allows us to evaluate the ability of the shield in protecting the agent from entering unsafe states, to study the influence of the shield on the learning performance, and to evaluate the safety of the final learned policy.
In this section, we start by giving details on the computation of the shield.In this context, we examine the runtime of shield computations.Afterwards, we discuss several experiments of shielding learning agents in tabular Q-learning and deep Q-learning.
The source code can be found at http:// onlineshielding.atalong with videos, evaluation data, and a Docker image that enables easy experimentation.

2-Player Snake
In the game of 2-player Snake, each player controls a snake of a different colour.A player wins, if it is either able to eat all randomly positioned apples of their own colour before the adversary snake collects all apples in its colour, or if it is able to cut off the other player.In the case that the heads of both snakes collide, the game ends in a tie.
We provide an open-source implementation of the 2-player Snake game.The game can be played on 6 different maps, one of which is shown in Figure 3.The game settings allow varying the lengths of the snakes and their speed, allowing the user to set different levels of difficulty.For both snakes, shielding can be activated and deactivated.The game can be played in the following player modes: (1) double player (two human players compete), (2) player against agent (human player plays against a trained learning agent), or (3) agent against agent.The third mode is used for reinforcement learning.One snake is controlled by the avatar (the RL-agent), and the adversary snake is controlled by a trained agent.
Implementation.The game's interface and logic was implemented using the pygame2 library.

Shield Computation
The task of the shield is to protect its snake from collisions with the adversary snake and with its own body.
In the safety-relevant MDP M, the avatar snake can make a decision at every crossing, thus the crossings define the decision locations.The states in M store the positions of the bodies of both snakes.Therefore, we store for each snake the location of the head, the tail, and all crossings that are covered by the body of the snake.The locations of the corridors covered by the bodies are then implicitly defined.The safety specification ϕ defines that the heads of the snake should not collide and that the head of the avatar snake should not crash into the body of the adversary snake or its own body.This results in a safety specification ϕ = G(¬Collision Heads(head s1, head s2) ∧¬Collision Bodies(head s1, {body s1}, tail s1, {body s2}, tail s2)), where the predicates Collision Bodies and Collision Heads compare the locations of the snakes and check for collisions.Each time the avatar snake enters a corridor, the shielding approach creates sub-MDPs M for the next crossing and possible future positions for the adversary snake.Given such a sub-MDP M and the safety property ϕ, we compute the minimal probability to violate ϕ within the next h steps.When the avatar snake arrives at the next crossing, the shield allows only the corridors with the highest safety value.The game, as shown in Figure 3, indicates the risk of taking a corridor from low to high by the colours green, yellow, orange, red.
We also implemented the optimisation to recalculate the shield after a decision of the adversary snake.Figure 4 contains two screenshots of the game on a simple map to demonstrate the effect of a shield update.In the left figure, the available tasks of the green snake are picking the corridor to the left or the corridor to the right.Both choices induce a risk of a collision with the purple snake.After the decision of the purple snake to take the corridor to its right-hand-side, the shield is updated and the safety values of the corridors change.
Experimental Set-up.The shield computation uses the probabilistic model checker Storm [9] and its Python interface to compute the safety of actions.We use the PRISM [25] language to represent MDPs and domain-specific optimisations to efficiently encode agents and tasks, that is, snakes and their movements.The experiments with tabular Q-learning have been performed on a computer with an Intel ® Core™ i7-4700MQ3 CPU with 2.4 GHz, 8 cores and 16 GB RAM.All experiments with deep Q-learning have been performed on a computer with an Intel ® Core™ i5-6600 CPU with Runtime Measurements.We report the time required to compute and analyze the safety of actions (i.e., to compute the shield) relative to the computation horizon.The experiments on computation time indicate how many steps shielding can look ahead within some given time.
When playing the game on Map 1 illustrated in Figure 3, we measured the time to compute shields, i.e., the time to construct sub-MDPs M and to compute the safety values.We measured the time of 200 such shield computations and report the maximum computation times and the mean computation times.Figure 5 presents the results for two different snake lengths l ∈ {10, 15} and different computation horizons h ∈ {10, 11, . . ., 29}.The x-axis displays the computation horizon h and the y-axis displays the computation time in seconds in logarithmic scale.
We can observe that up to a horizon h of 17, all computations take less than one second, even in the worst case.Assuming that every task takes at least one second, we can plan ahead by taking into account safety hazards within the next 17 steps.A computation horizon of 20 still requires less than one second on average and about 3 seconds in the worst case.Horizons in this range are often sufficient, as we demonstrate in the next experiment by using h = 15.
We compare our timing results with a similar case study presented by Jansen et al. [20].In a similar multi-agent setting on a comparably large map, the decisions of the avatar were shielded using an offline shield with a finite horizon of 10.The computation time to compute the offline shield was about 6 hours on a standard notebook.Note, that although the setting has four adversaries, the offline computation was performed for one adversary and the results were combined for several adversaries online.
Furthermore, Figure 5 shows that the snake length affects the computation time only slightly.This observation supports our claim that online shielding scales well to large arenas, i.e., scenarios where the safety-relevant MDP M is large.Note that the number of game configurations grows exponentially with the snake length (assuming a sufficiently large map), as the snake's tail may bend in different directions at each crossing.
The experiments further show that the computation time grows exponentially with the horizon.Horizons close to 30 may be advantageous in especially safety-critical settings, such as factories with industrial robots acting as agents.Since individual tasks in a factory may take minutes, online shielding would be feasible, as worst-case computation times are in the range of minutes.However, offline shielding would be infeasible due to the average computation time of more than 10 seconds that would be required for all decision states, of which there are thousands.As a result, computing an offline shield would require days and require a large shielding database.
Complexity Analysis.To analyse the complexity of the constructed sub-MDPs, let h be the horizon, let n a ≤ h (n e ≤ h) be the number of decision states reachable within h steps by the avatar (adversary), and let l be the length of the snakes.In the worst case, the avatar snake can bend at n a points.Therefore, the number of reachable states for the avatar snake within a horizon of h is at most h • l na .The same holds for the adversary snake, resulting in a state space in O(h•(l na •l ne )).Note that adding further adversary snakes adds additional factors l n .

Shielding for Tabular Q-learning
Next, we study the effects of shielding on a simple approximate Q-learning agent [36], called Agent 1, that we train on Map 1 (Figure 3).Learning parameters.The feature vector of the approximate Q-learning agent denotes the distance to the next apple.The Q-learning uses the learning rate α = 0.1 and the discount factor γ = 0.5 for the Q-update and an -greedy exploration policy with = 0.6.The reward function of the RL agent is positively affected (+10) by collecting an apple in its colour and by wins of the avatar (+50), i.e., if it collects all apples in its own colour before the adversary snake collects all apples in the colour of the adversary.The agent receives a reward from -100 for losing the game.
Results.To demonstrate the effects of shielding, we report the performance of shielded RL compared to unshielded RL during training measured in terms of gained reward.
Figure 6 shows plots of the reward gained during learning in the shielded and the unshielded case.The online shield uses a horizon of h = 15.The y-axis displays the reward and the x-axis displays the learning episodes, where one episode corresponds to one play of the Snake game.The reward has been averaged over 50 episodes for each data point.The plot demonstrates that shielding improves the gained reward significantly.By blocking unsafe actions, the avatar did not encounter a single loss due to a collision.For this reason, we see a consistently high reward right from the start of the learning phase.To evaluate the performance of the learned policies, we executed 100 games with the policies obtained from unshielded RL and shielded RL, keeping the shield in place.The shielded RL agent won 96 of all plays, whereas unshielded RL won only 54 plays.

Shielding for Deep Q-Learning
In the next experiments, we extend the simple tabular Q-learning setting into deep Q-Learning.The goal is to study the effects of shielding on the learning performance and the safety of the final policy, if the shield actively alters the reward function of the agent.
We call a shield to be an informed shield, if it provides the shielded RL agent with knowledge about safety constraints.Whenever the agent enters a state in which the shield blocks actions, an informed shield will send a negative reward to the learning agent for each blocked action.RL with informed shields basically explores unsafe actions along safe execution paths instead of ignoring them completely.In contrast to unshielded RL, unsafe actions are explored for just a single step and penalised immediately.
We explore the effect of informed shields for two learning agents trained with different reward structures: one that is conservative and focuses on eating apples, and a second one that encourages reckless behaviour that brings the agent into highly risky situations, in order to try cutting off the adversary snake.
Experimental Set-up.The game implementation, environment, and shield computation stay the same as in the previous experiment.The RL agent is implemented in PyTorch [29] and is trained for 270,000 steps with an Adam optimiser.The neural network approximating the Q-function consists of three 2-D convolution layers each followed by a batch normalisation layer.The output of the last 2-D convolution layer feeds into two stages of linear layers.In addition to the outputs from the previous layers, the second linear layer receives the locations of the snakes and apples as additional input.In order to reduce the complexity of the state, the input to the network is split into four channels containing information about: the map, the player snake, the enemy snake and the apples.

Shielding a Conservative Agent
In our next experiments, we considered an agent called Agent 2, trained on Map 2 (Figure 7 (left)).The agent receives more reward for eating apples than the tabular Q-learning agent, but receives no reward for winning the game through cutting off the adversary snake.The resulting agent avoids getting close to the adversary snake and focuses on collecting apples.In detail, the reward function for Agent 2 is defined as follows: the agent receives +30 for eating an apple, +100 for winning by eating apples, and -100 if the game is lost or tied.When using informed shields, the agent receives a reward of -100 from the shield for each blocked action.
Results.Training.Figure 7 (right) shows the obtained reward for Agent 2 during training in four different settings: unshielded learning (blue line, no shield), shielded learning (orange line, shielded with λ = 0.01), and learning with informed shields using different absolute thresholds (green line, informed shield: threshold λ = 0.01; pink line, informed shield: λ = 0.5).The learning agents are trained for 270,000 steps.We used training steps instead of episodes (games) for comparison, since especially in the beginning, the episodes in shielded learning are longer than in unshielded learning.This results from the shields preventing losing the game.Using the episodes in the x-axis would therefore favour the shielding approaches.The rewards at each time step are averaged over 1000 and 100 games, plotted in dark colours and light colours, respectively.
The graph shows that all shielded learning settings outperform unshielded learning.Additionally, the informed learning settings with λ = 0.5 achieved slightly better rewards than the uninformed shielding setting.In the informed setting, the learning agent updated its Q-values also for actions blocked by the shield with negative rewards instead of only performing updates for the executed actions.Even though the agents would not explore an unsafe action neither in the informed learning setting nor the standard shielding setting, the information about the unsafety of actions helped the learning agent to increase its learning performance.Note that the agents shielded with λ = 0.01 performed slightly worse than the agent shielded with λ = 0.5.A shield with λ = 0.01 yields a very strict shield that forbids any action that is slightly risky.Thus, the shield prevents the agent from eating apples whenever there is a slight risk of ending in a collision.This gives the adversary snake the opportunity to be faster in eating its own apples and to win.This illustrates the trade-off between safety and performance.
Intermediate Execution.To get more insights into the learned policies and the effects of shielding, we interrupt training at regular intervals to execute and evaluate the current policies.Every 10,000 steps of training, we execute the current policy for 1000 games without a shield.Figure 8 (left) illustrates the obtained rewards and Figure 8 (right) shows the number of safety violations (losses of the game due to a collision).
Since safety violations lead to a large negative reward, the values plotted on the left are negatively correlated with the values plotted on the right.It can be seen that policies created through unshielded learning clearly outperform the policies obtained via shielded learning and have fewer safety violations, when the shield is removed during execution.As expected, the policies obtained via shielded learning without feedback about unsafe actions fail to learn safety constraints.We see the same when using a very permissive shield that blocks actions only if the minimal probability of staying safe is less than 50 percent.In case of using a very strict shield that blocks all actions with a safety valuation of greater than 0.01 (the minimal probability of reaching unsafe states), we observe a significant decrease in safety violations for the informed learning setting.Note that executing the agents with a shield using λ = 0.01 would lead to zero safety violations.

Shielding a Reckless Agent
In our final set of experiments, we consider an agent called Agent 3, trained on Map 3 (Figure 9 (left)), with a reward structure that encourages the agent to cut off the enemy snake in order to win.In order to cut off the other snake, the RL agent has to get close to the adversary snake bringing itself in risky situations, which makes it interesting for shielding.The reward structure of Agent 3 is as follows: the agent receives a reward of +10 for eating an apple, +100 for winning the game, -100 for losing, and -50 if the game ends in a tie.When using informed shields, the agent receives a reward of -100 from the shield for each blocked action.
Results.Training.Figure 9 (right) shows the rewards for Agent 3 during training.All graphs for Agent 3 use the same measurement values and metrics as for Agent 2. The results for Agent 3 are consistent with the results obtained for Agent 2. All shielded learning settings outperformed unshielded learning.We observed that all three learned policies obtained with a shield did not focus on eating apples while avoiding the adversary snake, but focused on winning by cutting off the other snake and causing it to crash into the avatar snake.
Intermediate Execution.Figure 10 (left) illustrates the obtained rewards and Figure 10 (right) show the number of safety violations for Agent 3 during unshielded execution of policies.As before, every 10,000 steps of training, we execute the current policy for 1000 games without a shield .
We see even more significantly than before that the agent learns better to avoid safety violations when it is not shielded during training.Even in the informed case, where actions blocked by the shield are rewarded with -100, the agent does not understand the safety objective as well as in the unshielded case.The cause for this behaviour might be that the agent receives a negative update for actions that could lead to a safety violation instead of punishing actions that immediately lead to a safety violation.This might make it more difficult for the agent to understand the safety objective, as the action that potentially causes an issue is farther from the actual issue.
Execution.After 270.000 episodes of training, we evaluated the obtained policies.We executed each policy without a shield for 1000 games as well as with a shield using λ = 0.01 also for 1000 games.In Table 1 we report the averaged reward and the total number of wins.Using the shield in the execution results in similar results for all agents.Since we use a shield with λ = 0.01 during executions, the avatar snake only loses if the adversary snake manages to eat all apples and never due to a collision.Running the agents without a shield in the execution phase shows that the agent that was unshielded in training outperforms all agent that were shielded in training.

Conclusion and Future Work
In this paper, we propose an approach to prevent safety violations that can be avoided by planning ahead a short time into the future.Our online shielding exploits the time required to complete tasks to model and analyse the immediate future with respect to a safety property.For every decision at runtime, we create MDPs to model the current state of the environment and the behaviour of the agents.Given these MDP models, we employ probabilistic model-checking to evaluate every action possible in the next decision.In particular, we determine the probability of unsafe behaviours following every possible choice.This information is used to block unsafe actions, i.e., actions leading to safety violations with a probability exceeding a threshold relative to the minimal probability of safety violations.We evaluate online shielding in the context of RL, by empirically analysing the effect of shielding on learning performance and the safety of learned policies.For this purpose, we proposed informed shields that update the learner's value function by penalising blocked actions and compare unshielded RL and RL augmented with uninformed and informed shields.Our experimental results show that shielding improves the performance of RL agents during learning.However, the final learned policies inflict more safety violations than conventionally learned policies, when executed in unshielded environments.Hence, to guarantee safety of control policies obtained through shielded (or unshielded) RL, shielding needs to be applied during execution in the field.
For future work, we plan to investigate the influence of imperfect information on shielding and shielded reinforcement learning.Online shielding is well-suited for agents using unreliable sensors, as it could counter sensor defects at runtime as they occur.We also plan to study the application of online shielding in other settings, such as decision making in robotics and control.Another interesting extension would be to incorporate quantitative performance measures in the form of rewards and costs into the computation of the online shield, as previously demonstrated in an offline manner [3] and in a hybrid approach [32], where runtime information was used to learn the environment dynamics.
To enhance readability, we use pos(s) = s[0] = v, task(s) = s[1] = q, and turn(s) = s[2] = t to access the elements of a state s.We additionally define ava = 0, thus pos(s)[ava] and task(s)[ava] are the position and task of the avatar, whereas turn(s) = ava specifies that it is the turn of the avatar.There is a unique action α adv representing adversary decisions, there is a unique action α e representing individual activities (movement along edges), and there are actions for each task available to the avatar, thus A = {α adv , α e } ∪ Task(G).Definition 2 (Decision State).Given a safetyrelevant MDP M. We define the set of decision states S D ⊆ S via S D = {s D ∈ S | task(s D )[ava] = ∧ turn(s D ) = ava}, i.e., it is the turn of the avatar and its task queue is empty.

Fig. 3 :
Fig. 3: A screenshot of the Snake game using Map 1 with colour-coded shield display.

Fig. 4 :
Fig. 4: Screenshots from the Snake game using a simple map to demonstrate recalculation.

Fig.
Fig. Shield computation time for varying horizon values and snake lengths.

Fig. 7 :Fig. 8 :
Fig. 7: Results for Agent 2. Left: A screenshot of the game using Map 2; Right: Reward gained in training.

Fig. 9 :Fig. 10 :
Fig. 9: Results for Agent 3. Left: A screenshot of the game using Map 3; Right: Reward gained in training.

Table 1 :
Execution Results for Agent 3.