Multi-Agent Patrolling under Uncertainty and Threats

We investigate a multi-agent patrolling problem where information is distributed alongside threats in environments with uncertainties. Specifically, the information and threat at each location are independently modelled as multi-state Markov chains, whose states are not observed until the location is visited by an agent. While agents will obtain information at a location, they may also suffer damage from the threat at that location. Therefore, the goal of the agents is to gather as much information as possible while mitigating the damage incurred. To address this challenge, we formulate the single-agent patrolling problem as a Partially Observable Markov Decision Process (POMDP) and propose a computationally efficient algorithm to solve this model. Building upon this, to compute patrols for multiple agents, the single-agent algorithm is extended for each agent with the aim of maximising its marginal contribution to the team. We empirically evaluate our algorithm on problems of multi-agent patrolling and show that it outperforms a baseline algorithm up to 44% for 10 agents and by 21% for 15 agents in large domains.


Introduction
Unmanned Aerial Vehicles (UAVs) are increasingly becoming essential tools to carry out situational awareness tasks in a number of real-world applications ranging from disaster response [1][2][3] and security surveillance [3][4][5]. In these scenarios, multiple UAVs may be deployed to gather information at specific locations as quickly as possible in order to support an ongoing operation. However, such problems are often liable to a high degree of dynamism (e.g., fires may spread, wind direction may change) and uncertainty (e.g., it may not be possible to completely observe the causes of fires or the location of casualties may not be exactly known), and may also contain a number of hazards or threats for the UAVs (e.g., UAVs may fly close to buildings on fire or debris may fall on the UAVs).
In this paper, we consider the scenario where a set of UAVs aim to patrol the area to gather as much information as possible while minimising the negative impact of threats. Crucially, they aim to do so within an environment that is partially observable (i.e., the features of the locations are only fully observable where the UAV is located and partially observable at other locations). Hence, when planning the sequence of locations to visit, UAVs have the difficult task of estimating the information to be gained and threats to be encountered at these locations. This problem is compounded by the fact that the dynamism inherent to the environment may cause the information and threats at each location to change over time (i.e., the environment is stochastic). For example, when UAVs visit a building in a disaster area, the building states (intact, about to collapse, collapsing, or collapsed) may correspond to threat states (levels) for UAVs, and the threat at each location may be changing stochastically, such that it switches between "about to collapse" to "collapsed" due to an aftershock [6]. The information in the environment may also change dynamically (e.g., a victim may get out of danger or the fire may get close to a victim).
To date, a number of approaches to information gathering with teams of UAVs have been proposed. However, most of the work [3,7,8] focus on developing algorithms for UAVs gathering information in dynamic environments where the model of the features of the environment is fully observable and stationary (see Related Work section for more details). Furthermore, none of these approaches have considered how threats may affect the information gathering process while the environment is partially observable and non-stationary. Unless such issues are tackled, we believe it is unlikely that large UAV deployments in real major disaster will be feasible.
In recent years, agent-based modelling has been effectively used to formulate and solve the problems of planning in environments characterized by uncertainties [9]. In agent-based models, an agent is an encapsulated computer system that is situated in some environment and that is capable of flexible, autonomous action in that environment in order to meet its design objectives [10]. Such agents are either software or hardware (e.g., robots or unmanned autonomous systems (UAS)). In particular, operating in uncertain environments, autonomous agents have to deal with executing actions that may not have the intended results, with environments that change while the agent is operating, and with making observations that might not be completely accurate.
Against this background, we propose a agent-based model for patrolling under uncertainty and threats and go on to develop a novel algorithm to solve the planning problem that it poses. In more detail, we first model the information and threats on a graph representing the environment, where the information and threat at each location are independently modelled as multistate Markov chains (which captures the non-stationary feature), whose states are not observed until the location is visited by an agent (which captures the partially observable feature). Then, we cast the single-agent patrolling problem as a Partially Observable Markov Decision Process (POMDP), which provides a rich model for planning and acting in partially observable stochastic domains [11]. Unfortunately, existing POMDP solvers are very inefficient to solve our POMDP formulation due to the exponential growth of the number of possible paths of agents in the size of the graph and the number of the possible observations along each possible path (see Related Work section for more detail). Hence, we propose an online algorithm to solve the patrolling problem for one agent at a time. (In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start. In contrast, an offline algorithm is given the whole problem data from the beginning and is required to output an answer which solves the problem at hand.) In particular, the algorithm utilises a predictive heuristic that only refers to the possible paths (looking ahead several steps) from the current position of the agent. Building upon this, to compute patrolling policies for multiple agents, the singleagent algorithm is extended for each agent with the aim of maximising its marginal contribution to the team. In summary, this paper advances the state of the art in the following ways: • We propose the first algorithm for multi-agent patrolling under uncertainty and threats. Our formulation does not only capture the partially observable and non-stationary features of the dynamic environment, but also accounts for the health status of the patrolling agents.
• We design a predictive heuristic to estimate the value of each possible path from current position of the agent and provide an online algorithm to solve the patrolling problem for one agent at a time. Moreover, we propose a multi-agent algorithm that sequentially computes policies for individual agents. In particular, we also show that our multi-agent algorithm scale to larger environments (i.e., more than 10 agents) than existing solutions.
• We evaluate our algorithms in simulations and show that our algorithm outperforms a baseline algorithm up to 44% for 10 agents and by 21% for 15 agents.
The remainder of this paper is structured as follows. First, we review the literature on patrolling problems. We then present our model for the problem of multi-agent patrolling under uncertainty and threats. Given this, we formulate the single-agent patrolling problem as a POMDP and provide an algorithm that computes policies for individual agents. Finally, we propose our multi-agent algorithm and evaluate it in the simulations of multi-agent patrolling in a large environment.

Related Work
In this section, we review related work on agent based model and approaches for multi-agent patrolling problems.
In general, methods to gather situational awareness without considering threats are typically categorised as a class of information gathering problem [3], in which agents aim to continuously collect and provide up-to-date situational awareness. For these dynamic environments, previous work [3,7,8] consider fully observable (agents can directly observe the underlying state of the environment) stationary models (joint probability distribution of its states do not change when shifted in time). A partially observable model has been proposed in [12], where an agent can only perceive the exact state at its current position. Game-theoretic approaches [13][14][15][16][17][18] have focused on patrolling to guard important targets in the presence of strategic evaders or intruders; a problem that is characterised by (possibly multiple) attackers attempting to avoid capture or breach a perimeter. The agents' main challenge in such cases is to detect and capture these attackers in an effort to minimise loss. However, these approaches do not consider the health status of the agents and the damage that agents can suffer while patrolling.
Stationary models of the information/threats are considered in previous work. The work on information gathering in dynamic environments [8] have focused on specific environmental phenomena (e.g., monitoring algal bloom growth in lakes and salt concentration in rivers) rather than stochastic events as in our scenarios. Markov models are widely used to model non-stationary stochastic states in the world, such as the specific ground targets for aircraft [12,19,20] and sensors [21], physical activities in wireless network [22], and channel memory in communication systems [23,24]. However, a number of strict assumptions are made in these works in terms of the Markov models used. For example, each target at each period can be in one of only two states [12,23] and the matrix of the Markov models must satisfy some special formations [24].
Among these works, a Markov Decision Process (MDP) based algorithm that computes policies for individual agents has been proposed in [3] to solve continuous information gathering in fully observable environments. Our formulation in this paper mainly extends [3] to patrolling under threats in partially observable and on-stationary environments and cast the singleagent patrolling problem as a POMDP. However, solving this formulation using current POMDP solvers [25] for all but the smallest instances is impossible due to the exponential growth in the number of possible paths of agents that can be traced in the environment and the number of the possible observations along each path. The POMCP algorithm has been proposed in [26] and has been shown to generate good solution quality and scale to large POMDPs. However, to the best of our knowledge, developing scalable approaches that extend POMCP to solve multi-agent POMDPs is still a open problem. As these possible benchmarks are unable to scale to multi-agent instances of our formulation, we design a baseline algorithm that greedily select the policy for one time step as a benchmark.

Methods
In this section, we present the model for the problem of multi-agent patrolling under uncertainty and threats. Specifically, we first model the physical environment in which the agents operate and then go on to describe the decision problem faced by the agents.

The Patrolling Problem
We formulate the patrolling problem by defining the physical environment and patrolling agents. In particular, we present the Markov models of information and threat at in the environment to capture the non-stationary feature.
The physical environment. The physical environment is defined by its spatial, temporal and dynamic properties. In particular, in the aftermath of a disaster, a number of specific sites might need urgent attention and access to these sites may be limited to certain areas (e.g., due to trees, debris, or natural obstacles). Hence, we can capture such features in terms of paths along which agents can travel from one disaster site to another. Specifically, the spatial property of the environment is encoded by a graph, which specifies how and where agents can move.
Definition 1 (Graph) We model an area of the environment as an undirected graph G = (V, E), where each vertex V representing spatial coordinates are embedded in Euclidean space and edges E encode the movements that are possible between them. Here, we denote N = jVj.
In disaster response, each disaster site is a vertex in the graph, and a traversable area between a pair of sites is an edge of the graph.
Definition 2 (Time) Time is modelled by a set of time steps {1,2,. . .,T} and at each time step t 2 {1,2,. . .,T} the agents visit some sites in the environment.
To capture the dynamic attributes of the environment, we assume that each vertex holds two states: one for information and one for threats.
Definition 3 (Information State Variable) An information state variable indicates different levels of the information at a given vertex.
For example, how many people need help and what is the status of a bridge are information state variables in disaster response scenario.
Definition 4 (Threat State Variable) A threat state variable reflects the level of damage an agent suffers when visiting a given vertex.
For example, the level of fire and the degree of smog are typical threat state variables in disaster response.
Definition 5 (Markov Model of Information and Threat) The two state variables at each vertex change over time according to discrete-time multi-state Markov chains.
To capture the transitions of the state variables, we employ a Markov chain model. Specifically, for a Markov chain with K states S = (S 1 , S 2 , . . ., S K ), the matrix of transition probabilities for pairs of states is defined as: where p ij is the probability that threat state S i transitions to S j in one time step and S i , S j 2 S. An example of information and threat models at a vertex is shown in Fig 1. Thus, Fig 1(a) shows a threat model with 2 states (i.e., R 1 and R 2 ) and Fig 1(b) shows an information model with 3 states (i.e., I 1 , I 2 and I 3 ), where the probabilities of each information/threat state changes over a time step are given (e.g., the probability of R 1 changes to R 2 is 0.1). The set of information states I n ¼ fI n 1 ; I n 2 ; . . . ; I n K n I g for location v n correspond to an amount K n I of information which agents obtain when visiting v n . The value of information is determined by the function f n :I n ! R + , and f ðI n k Þ increases monotonically with k 2 f1; . . . ; K n I g, which indicates that the states of information are ordered in terms of their value. The information state at a given vertex independently evolves as a K n I -state Markov chain model with a matrix of transition probabilities P n I . Similarly, the set of threat states R n ¼ fR n 1 ; R n 2 ; . . . ; R n K n R g indicate the K n R threat levels of vertex v n 2 V. The "damage" that an agent suffers when visiting vertex v n is captured by the function c n :R n ! R + , and cðR n k Þ increases monotonically with k 2 f1; . . . ; K n R g. The threat state at a given vertex independently evolves over time as a K n R -state Markov chain and the matrix of transition probabilities is P n R . Having modelled the environment in which the agents operate, we next elaborate on the agents' goals.
Patrolling Agents. We define a patrolling agent (agent for short) as a physical mobile entity situated in the environment defined above, capable of gathering information, and maybe damaged by the threat when visiting a vertex. The set of all agents is denoted as A = {1,. . .,jAj}. Then, the movement and visit capabilities of agents are formulated as follows. When patrolling in a graph G, each agent is positioned at a given vertex in G at each time step t. The movement of each agent is atomic, i.e., takes place within the interval between two subsequent time steps, and is constrained by G, i.e., agent m positioned at a vertex v i 2 V can only move to a vertex v 0 i 2 adj G ðv i Þ that is adjacent to v i in G. We assume that 8v i 2 V, v i 2 adj G (v i ), i.e., an agent can , where the probabilities of each information/threat state changes to another over a time step are given (e.g., the probability of R 1 changes to R 2 is 0.1).
doi:10.1371/journal.pone.0130154.g001 also stay at the same vertex. The speed of the agents is sufficient to reach an adjacent vertex within a single time step. Time can be discretised according to the speed of the UAVs. Thus if the UAVs can travel between sites in a five minutes, then a time step may be set at 5 minutes in the model. Given this, an agent visits a vertex v n when it is positioned at that vertex. On the one hand, a visit results in the agent being aware of the current information and threat state at v n , such as I n i and R n j respectively. On the other hand, this agent obtains a reward f n ðI n i Þ and suffers a loss c n ðR n j Þ for a visit. The time it takes to visit a vertex is assumed to be negligible. We let F n ¼ ½f n ðI n 1 Þ; . . . ; f n ðI n K n I Þ denote the information value vector, where f n (I k ) is the information value that an agent could get if the information state is I n k (e.g., information at a vertex has 3 states and corresponds to 3 information values [0, 2, 5]). Similarly, we let C n ¼ ½c n ðR n 1 Þ; . . . ; c n ðR n K n R Þ denote as the damage value vector at vertex v n , where c n ðR n k Þ is the damage value that an agent will lose if the threat state is R n k (e.g., fire level at a position has 4 states which corresponds to 4 levels of damage [0, 4, 6, 10], and smog degree at a position has 3 states which corresponds to 4 levels of damage [0, 2, 5]). For each visit, the information at that vertex is obtained by the agent and we regard that the information state at a given vertex v n will reset to I n 1 when an agent visits this vertex (I n 1 is the information state which means no new information was generated at v n after last visit). As the states at each vertex change over time and agents can only access the exact states at the vertices that they visit, the patrolling environment can be considered nonstationary (i.e., joint probability distribution of its states may change when shifted in time) and partially observable.
Furthermore, in this paper, we make two assumptions about the communication and cooperation among agents as follows.
Assumption 1 All the agents can share their collected observations with each other via communication. Such peer to peer communication is free of noise, costs, and delays.
Consider a centralised station is organized to coordinate a team of UAVs for monitoring the continuously changing state of a disaster area, where each UAV can full communication with this station and Assumption 1 is satisfied in these domains. However, in some real scenarios, UAVs can only coordinate with each other using limited communication and decentralised approaches may be more appropriate (but this is beyond the scope of this paper and will be considered in future work).
Assumption 2 When more than one agent is visiting a vertex, only one information value is obtained for the team but each agent suffers the same damage that may be generated at that vertex.
This assumption is satisfied in the scenarios where the information gathering capability of one agent at a vertex is equal to that provided by multiple agents with the same sensors, and agents independently suffer the damage caused by threats. In future work, a model of information fusion for multiple (heterogeneous) agents will be considered. Thus, the team of agents need to coordinate with each other based on their observations while patrolling. Specifically, the goal of the agents is to gather as much information as possible while minimise the damage incurred.
We now provide a simple example to explain how the agents would operate in this scenario. Consider an agent that enters into a building on fire. In our setting, this is equivalent to the agent visiting a node in the graph. The fire level (threat state variable) and valuable information about victims and assets (information variable) changes over time. While exploring the building, the agent may acquire some information and suffer some damage due to the fire. At each time step, an agent selects one adjacent building to visit based on the estimated information value and the prior observation of threat states at each location. It then obtains a reward based on the value of the information, and suffers a loss which is associated with the threat state. Then, the information state at the visited vertex is immediately reset.
Having defined the patrolling problem, we now need to plan the sequential patrolling actions for agents based on the history of actions and observations, and the model of the environment. Hence, in what follows, we first propose a POMDP formulation for single-agent patrolling within a graph and design an algorithm to solve it. Building upon this, we propose a scalable multi-agent patrolling algorithm.

Single-Agent Patrolling
In this section, we first formulate the POMDP based framework for single-agent patrolling problem. POMDPs imply that the agent does not know the exact state it is in, and the agent requires to keep track of each observation received, in order to maintain a probability distribution, known as the belief state, over the possible states [11]. Thus, we analysis that a standard representation of belief state makes the POMDP computational intractable and then present a compact representation of belief state for our POMDP formulation. Given this, we propose a predictive heuristic and an online single-agent algorithm.
The POMDP Framework. We now set up the single-agent patrolling problem as a POMDP hS,A,O, T,O, ri as follows: • S is the set of states. A state is defined as a tuple s ¼ ½v; ðs 1 R ; . . . ; s N R Þ; ðs 1 I ; . . . ; s N I Þ 2 S, where v is the current position of the agent, s N R 2 R n and s N I 2 I n are the threat and information states at vertex v n 2 V. We denote s e ¼½ðs 1 R ; . . . ; s N R Þ; ðs 1 I ; . . . ; s N I Þ 2 S e , as the state that captures the information and threat states at all of the positions. Given this, the number of the states in S increases exponentially with the number the vertices.
• A is the set of all actions. The agent select an adjacent vertex to visit as an action.
• O is the set of observations. We define an observation o ¼ ðv i ; s i I ; s i R Þ 2 O as the current position v i and the information and threat state at this position.
• T is the set of conditional transition probabilities. We assume that v is deterministic and only determined by the destination of movement of the agent. Based on the Markov models defined in the patrolling problem, s e follows a discrete-time Markov process with Q N n¼1 K n R K n I states. • where α is a weight parameter of the two objectives.
The objective of the agent is then to choose the movement actions sequentially to maximize the total expected reward accumulated over T steps.
In this model, the states are not directly observable. Hence, a standard belief vector B(t) = [b 1 (t), . . ., b M (t)] is defined as the posterior probability distribution over the possible states S, where b m (t) is the conditional probability that the environment state is at the mth state at the current time step t. For any t, it has been shown in [27] that this belief vector is a sufficient statistic for the design of the optimal action for each time step. A policy π specifies the action that will be executed in any given belief state and the optimal policy π Ã is a policy by which the agent gets the maximum total expected reward accumulated over T steps. However, as each environment state is an joint state of the information and threat states at all of the vertices, the number of possible states S that defined in our POMDP is Q N n¼1 K n R K n I , which increases exponentially with the number the vertices. Moreover, as the belief vector is defined as the posterior probability distribution over these possible states, the dimension of this belief vector also increases exponentially with the number the vertices.
To address this, we propose an online method by introducing a belief vector of reduced dimension and develop a predictive heuristic to reduce the search space and still produce high quality solutions (as we show later).
Compact Belief Representation. As the threat state and information state variables at each vertex evolve independently and v is deterministic, we can find a sufficient statistic for the optimal policy whose dimension linearly grows with N, similar to [23,24]. We introduce a compact representation of belief state and its transition function in this section.
We define a sufficient statistic belief vector of the environment states at time t as the vector of the conditional probabilities (conditioned on the observation and decision history) where w n Rk 1 ðtÞ is the probability that the threat state at vertex v n is R n k 1 , k 1 ¼ 1; . . . ; K n R and C I (t) is defined as: where w n Ik 2 ðtÞ is the probability that the information state at vertex v n is I n k 2 and k 2 ¼ 1; . . . ; K n I . Then C(t) is a sufficient statistic of optimal decision making [23,24]. By exploiting the statistical independence among vertices, we reduce the dimension of the sufficient statistic from Q N n¼1 K n R K n I to S N n¼1 ðK n R þ K n I Þ, which grows with N linearly. This allows us to reduce the computational and storage complexity of the optimal patrolling policy from exponential to linear.
Theorem 1 For any time t, C(t) is a sufficient statistic for the design of optimal policy for our POMDP formulation.
Proof We show that when the information and threat at the N vertices evolve independently, each element b m (t) in the standard belief vector B(t) can be obtained from C(t), where b m (t) is the conditional probability that the environment state is at the ith state. Without loss of generality, we consider N = 2. Let I(t) denote the history up to the beginning of slot t. Let τ n denote the most recent time instant when vertex v n is visited. We can thus write an entry of b m (t) as in Eq (4). Quantities in Eq (4) are entries of C(t). Hence, C(t) is a sufficient statistics.
Initially, we assume that we have probabilistic information about the state of each of the N vertices C(0) = [C R (0),C I (0)]. Then, the elements of belief vector C(t) are updated to C(t+1) upon action a = v i and observation o ¼ ðv i ; s i I ; s i R Þ as: where 8v n 2 V, R n k 2 R n , I n k 2 I n , andĨ k is a unit vector with the k th item is 1, P n R and P n I are respectively the matrices of transition probabilities of threat and information at position v n . As shown in Eq (5), the threat belief vector w n R ðt þ 1Þ at one vertex v n that some agent is visiting is updated toĨ k based on the observation R n k ðhÞ at this vertex, while w n R ðt þ 1Þ at some other vertex that no agent is visiting is updated by the current threat belief vectors w n R ðtÞ and threat Markov model P n R at this vertex. A similar explanation holds for the update to the information belief w n I ðt þ 1Þ , as for v n 2 V. Based on the transition function above, a policy π specifies a sequence of actions π = [π(1), π(2),. . .], where π(t) is the position selected to visit at time t. Given this, the optimal policy can be computed as: where R π(t) (C(t)) is the reward obtained when the belief state is C(t) and γ 2 [0, 1] is the discount factor. Although the dimensionality of the belief state is reduced, the problem is still a POMDP and finding the optimal solution is intractable. Based on this reduced belief vector, we next develop a predictive heuristic and present the online single-agent algorithm that implements this heuristic.
The Predictive Heuristic. In order to develop a predictive heuristic for online policy selection, we first introduce the assumption that the Markov state transition matrices are monotone matrices, which means that the higher the information/damage value of the vertex's current state the higher is the likelihood that the next state of this vertex will be of high information/ damage value. Then, we show how to define the predictive heuristic as the predictive expected future reward based on the monotonicity of the transition matrices.
Stochastic dominance is a central theme in a wide variety of applications in economics, finance and statistics [28]. Similar assumption has been made to model the states of the channels in communication systems [23,24] and the states at targets for UAVs monitoring [12].
Stochastic dominance 1 between two Z dimension probability vector x, y is defined as x 1 y, if: We assume that the Markov information model and Markov threat model are monotonic matrices, i.e., the matrix of transition probabilities P n R and P n I satisfies: If the matrix of transition probabilities P n R and P n I satisfy the assumption above, then P n R and P n I are monotone matrices [29]. Under this assumption, the higher the information value of the state of the current vertex the higher is the likelihood that the next state of this vertex will be of high information value, i.e., if w n I ðtÞ 1 w n 0 I ðtÞ, then w n I ðtÞP n I 1 w n 0 I ðtÞP n 0 I . From (5), we know that probability vectors for information states of two vertices keep the relationship of stochastic dominance when no agent visits any of them. Obviously, if w n I ðtÞ 1 w n 0 I ðtÞ, then w n I ðtÞF n ! w n 0 I ðtÞF n 0 , which means that a stochastically dominant information belief vector is likely to have a higher information value. The same is true that a stochastically dominant threat belief vector is likely to have a higher damage value. In particular, as the information state at a given vertex will reset to I 1 when there is an agent visiting this vertex, the belief vector of information states (1,0,. . .,0) is stochastically dominated by the belief vector of any vertex which is not being visited, so the more recently visited vertex always has a lower expected information value.
To note, our monotonicity assumption is not a constraint that makes the information value (or the damage of the threat) increasing with time, but a model that the probability vector of the information (or threat) transition matrices satisfy the feature of stochastic dominance. We now provide a example of a 4-state Markov threat model at a vertex as follows: It can be seen that the Vectors of P R satisfy the condition of Eq (8), i.e., P R4 1 P R3 1 P R2 1 P R1 , where for P R3 1 P R2 as an example, the elements of P R2 and P R3 match the condition for stochastic dominance of Eq (7) as: For example, if the threat states at vertices v 1 and v 2 are respectively w 1 R ¼ ½0:1; 0:2; 0:5; 0:2 and w 2 R ¼ ½0:2; 0:4; 0:3; 0:1, i.e., w 1 R 1 w 2 R . Then, v 1 is likely to have a higher next threat state than v 2 . However, after a time step, it is possible that any threat state may switch to not only a higher state, but also a lower state.
Then given the monotonicity assumption, we can use the relationship between the belief states at different vertices in order to "predict" the belief state at an unvisited node. Hence, we can estimate the expected reward agents may get from one vertex of the graph when visiting it at a near future step. We denote a feasible policy of length D at time t as π D (t) = (π t+1 ,. . ., π t+D ), which consists of D consecutive deterministic vertices/actions.
Here, we define the predictive heuristic as the predictive expected future reward E½Rðp D ðtÞÞ for policy π D (t), which is the aggregate of the expected reward of each step in π D (t) as: where, ½ŵ p tþi I ðt þ iÞ;ŵ p tþi R ðt þ iÞ is the predictive belief vector at the vertex π t+i and time t+i. For the step t+1, we can get the predictive belief vector ½ŵ p tþ1 I ðt þ 1Þ;ŵ p tþ1 R ðt þ 1Þ by the current belief vector C(t), current action a(t) and observations θ(t), i.e. Cðt þ 1Þ ¼ d CðtÞ j a Ã t ; yðtÞ À Á , which is the belief vector at t+1 and obtained from Eq (5). For {t+2,. . ., t+D}, we get the predicted belief vector based on a transition which omits observations in Eq (5) as follows:ŵ Given the predictive heuristic and policies that looks ahead D time periods, the agent compares all feasible paths of length D and chooses the next location to visit according to the path that gives the highest predictive expected reward gained over that path. The details of how to use the heuristic in our online single-agent algorithm is presented in the next section.
The Online Algorithm. Based on the predictive heuristic, we propose an online algorithm for single-agent patrolling problem (Algorithm 1) in this section. 4 Step 1.3: Compare π D (t) with the stored best policy: 9: if E½Rðp D ðtÞÞ > E½Rðp Ã D ðtÞÞ then 10: p Ã D ðtÞ p D ðtÞ 11: end if 12: end for 4 Step 2:return the next action from the best policy p Ã First, we compute P D (t), which is the set of all the feasible policies that start from current position v(t) (step 0), where we name the parameter D as the maximum horizon, i.e. the number of horizons we look ahead in the POMDP. Then, we compute the predictive expected reward for all the policies. For each policy, the belief state at t+1 is updated by the belief state, position and observations at t by Eq (5) (line 2) and the predictive belief state at {t+2,. . ., t+D} is computed by Eq (10) (line 3-7). Given this, we compute the predictive reward for the policy (line 8). Thus, the best policy is: The best next action here is computed as a Ã ðt þ 1Þ ¼ p Ã tþ1 , which is the first action of best policy (line 13).
Having defined the online single-agent algorithm for our formulation of patrolling under uncertainty and threats, we extend it to compute policies for multi-agent problems next.

Multi-Agent Patrolling
For multi-agent patrolling, we assume all the agents can share their collected observations with each other with full communication. Thus, team of agents may not only obtain more information about the environment, but each agent may also make decisions given observations are shared by other agents. Given this, we formulate the multi-agent patrolling problem as a Multiagent POMDP (MPOMDP) and design an scalable online multi-agent algorithm to coordinate the actions of agents in their patrolling tasks.
A MPOMDP with complete communication can be reduced to a POMDP with a single centralised controller that takes joint actions and receives joint observations [30]. We now set up our problem of multi-agent patrolling in a graph as a POMDP hM,S,A,O, T,O, ri as follows.
• M is the set of the agents.
• S is the set of states. A state is defined as a tuples ¼ ½ṽ; ðs 1 R ; . . . ; s N R Þ; ðs 1 I ; . . . ; s N I Þ 2 S, wherẽ v is the current positions of agents, s N R 2 R n and s N I 2 I n are the threat and information states at vertex v n 2 V. We denotes e ¼½ðs 1 R ; . . . ; s N R Þ; ðs 1 I ; . . . ; s N I Þ 2 S e , as the state that captures the information and threat states at each position.
• A is the set of all joint actions. The agents select adjacent vertices to visit as an joint action.
• O is the set of joint observations. For current positions of the agents and the information and threat states of their current positions, we define a joint observation o ¼ fṽ; fo i j8v i 2ṽgg 2 O, where o i ¼ ðs i R ; s i I Þ is the observation of agent i. • T is the set of conditional transition probabilities. We assume thatṽ is deterministic and only determined by the destinations of the joint movement of agents.s e follows a discrete-time Markov process with Q N n¼1 K n R K n I states.
• O is the set of observation probabilities. As an observationõ is directly a part of some states, the observation probability Oðõjs 0 ;ãÞ ¼ 1 ifõ is consistent with the corresponding part of s 0 and Oðõjs 0 ;ãÞ ¼ 0 otherwise.
• r:A × O ! R is a reward function. rðã;õÞ is the sum of the reward obtained by the agents which associates to the joint actionã and observationõ: where n v i is the number of agents who are visiting v i .
The objective of the agents is then to choose the movement actions sequentially to maximize the total expected reward accumulated over T steps. Then, we note that, while the state variable described in Eqs (2) and (3) can be used to express the belief vector of the environment states for a multi-agent POMDP, the joint action space of the POMDP is the Cartesian product of the action and observation spaces of the individual agents. However, in so doing, the size of the joint action space and joint observation space grows exponentially with the number of agents jMj, allowing only the smallest of problem instances to be solved. Instead, sequentially computing policies for individual agents as in our multi-agent algorithm avoids this problem of computing a joint policy for the team at the expense of solution quality. However, a bounded optimal of this multi-agent algorithm is guaranteed (we analyse this later).
Similar methods have been successfully used to solve multi-agent problems [3,8]. As these formulations are different from our partially observable scenarios, a straightforward application of their methods is not possible. Hence, we consider how to sequentially compute policies for individual agents in partially observable problem using our online single-agent algorithm.
When sequentially computing policies for individual agents using our predictive heuristic, there implicitly exists an order in which the agents make actions; agent 1 completes D step actions of its best policy, agent 2 second, etc.. The expected future reward of a policy p i D ðtÞ of agent i is conditioned on position v i (t), belief vector C(t) and the best policies of the previously computed policies of agents M −i = {1,. . ., i−1}.
The best online patrolling policy for agent i in a multi-agent setting is recursively defined as: where we usep Ã i denotes the best policy of agent i. To ensure the reward function only takes into account the marginal reward value, we need to exclude double counting. There are two types of double counting. First, synchronous double counting, which occurs when two agents patrol the same cluster within the same time step. In this case the reward for patrolling the vertex is received twice. Second, asynchronous double counting, which occurs when agent i decides to visit vertex v n at t 1 , and there was an action of visiting v n by agent j (j < i) at t 2 (t 1 < t 2 ) during the D horizon i.e., the agent j will visit vertex v n after agent i. For the situation where agent j visits vertex v n before agent i (i.e. t 1 ! t 2 ), it has been accounted when calculating E½Rðp D ðtÞÞ in Eq (9).
Here, we show how to deal with the asynchronous double counting, i.e., agent i decides to visit vertex v n at t 1 and there was an action of visiting v n by agent j (j < i) at t 2 (t 1 < t 2 ) during the D horizon. Without loss of generality, we consider the situation that only v n in p i D ðtÞ of agent i has been visited by agent j. If more than one agent of M −i = {1,. . ., i−1} has an action to visit v n , we assume the time t 2 is nearest to t 1 (only the nearest one needs to be taken into account and this can be deduced from the transition Eq (10)). Based on this assumption, we can see that the expected information reward of agent j for visiting vertex v n is overestimated, as it is unaware that the i will reset the information at the time t 1 . Thus, we introduce a penaltyp 2 R þ for agent i that compensates for the reduction of reward of agent j, as follows: where E½Rðp D ðtÞÞ is the expected reward function defined in Eq (9), andp is the loss incurred by agent j that will visit the vertex v n after i, which is defined as follows: where ther expected 2 R þ is the expected reward that agent j computes for visiting vertex v n and ther revised is the revised expected reward of agent j visiting vertex v n as computed by agent i considering only its action. We define the revised expected belief states at vertex v n and between time [t 1 +1,. . ., t 2 ] are fw n ðt 1 þ 1Þ; . . . ;w n ðt 2 Þg, which are obtained by the transition Eq (10) based on the predictive belief stateŵ n ðt 1 Þ and action a(t 1 ) = v n . Then the revised expected reward is as follows:r Now, using the algorithm to compute the policy of length D as before, we obtain an action for each individual agent. A team action is formed by combining these individual actions. This team action is not optimal, as the policy of agent i is computed greedily with respect to the policies of agents M −i . However, we can still bound the the performance guarantees compared with the policy obtained by searching the joint action space.
We use the theorem from [31] to obtain a bound on the value of the greedily selected policies: Theorem 2 Let g:2 E ! R be a non-decreasing sub-modular set function. The greedy algorithm that iteratively selects the element e 2 E that has the highest incremental value with respect to the previously chosen elements I 2 E: until the resulting set I has the desired cardinality k, has an approximation bound gðI G Þ gðI Ã Þ at least 1 À kÀ1 k À Á k , where I Ã 2 E is the optimal subset of cardinality k that maximises f.
For the number of agents jMj in our formulation, the approximation bound of the greedy algorithm is 1 À jMjÀ1 jMj jMj . It has been shown in [3] that this approximation bound is monotonically decreasing with jMj, and thus as, for jMj ! 1, the multi-agent policy yields at least % 63% of the reward obtained using the best policy that searches the joint policy space for jMj agents.
Having formulated the problem and designed both single-agent and multi-agent algorithms, we will evaluate our methods in the next section.

Empirical Evaluation
To empirically evaluate our approach, we applied it to 10 and 15 agents continuously patrolling in a large graph, which contains 350 vertices and 529 edges. The online computing time limit is 0.5s because agents must decide which locations to visit at each time step within that time limit. As the single-agent algorithm in the paper can be seen as a special case of the multi-agent algorithm, we just present the results of the multi-agent algorithm here. In the aforementioned graph, we simulated two scenarios: • Scenario A: we use the same Markov information and threat models for every vertex in the graph; • Scenario B: we apply 3 different information and threat models to different vertices in the graph.
Notice that for Scenario A the information and threat models at different locations are homogeneous. However, as these information and/or threat are non-stationary, the information/ or threat state are various among these locations. We use Scenario A aiming to capture the situation where the locations in the environment hold same types of information and threat. For Scenario B, the information and threat models at different locations are heterogeneous, i.e., different locations in the environment may hold various types of information and threat.
We set the parameters in reward function (i.e., Eq 1) and value function (i.e., Eq 6) as: the weight parameter α = 0.33 and the discount factor γ = 0.9. More specifically, in Scenario A, we define the two Markov chains as follow: Here, the transition function P R and P I satisfies monotonic assumption of Eq (7). For example, the first two rows in P R satisfy the constraints in Eq (8) Fig 2, where the size of the circle of each location denotes the absolute value of instance reward of each vertex, the colour denotes its sign (black is positive and red is negative), the green circles are agents' current locations and "R" and "r" in lower right of each vertex denotes the threat state of this vertex is "2" and "1" respectively.
For standard POMDP solvers such as POMCP, the size of the joint action space and joint observation space grow exponentially with the number of agents, which makes them are intractable for our multi-agent patrolling problem with large number of actions and observations. Hence, we benchmark against a random algorithm (Random) and a baseline algorithm (Baseline), and measure the total reward of the information value and the damage suffered of agents using them. Specifically, • Random moves the agents to a random location adjacent to the agents' current position.
• Baseline moves the agents to the adjacent location with the highest value in the next step.
We assume the baseline algorithm sequentially computes policies for individual agents to avoid different agents selecting the same vertex, which is similar to PH-1.
• PH-D is our multi-agent patrolling algorithm, where D is the maximum horizon, i.e. the number of horizons we look ahead. We adjust maximum horizon D from the set {2,4,8} to investigate the extra computation involved for higher values of maximum horizon. We illustrated the results of our algorithms of different maximum horizon.
The initial locations of the agents are randomly distributed in the graph. Agents patrol continuously for 3000 time steps in the stochastically changing graph. For each scenario and each algorithm we ran 1000 rounds and plotted the results in Figs 3 and 4 where the error bars depict the 95% confidence intervals around the means. Non-overlapping error bars invalidate the null hypothesis with α = 0.05. In both scenarios, Random performs poorly and its total reward never reaches more than 30% of the reward obtained by the other two algorithms. In Scenario A, both PH-8 and Baseline perform well, and PH-8 outperforms than baseline algorithm by at least 5%. However, for the graph with different Markov models in Scenario B, our algorithm is significantly better than all the other algorithms, and PH-8 outperforms the baseline algorithm by more than 44% for 10 agents and by 21% for 15 agents. In addition, with different maximum horizon D from {2,4,8}, the reward obtained by PH-D increases with D as well as its computation time increases with D exponentially. For D > 8, the computing time for each step is out of our time limit for online decision making. Thus, we can conclude that the use of our predictive  heuristic in Ph-D has a significant impact on performance and that D can be adjusted to tradeoff between quality and computation time while still outperforming baseline algorithms.

Conclusion
In this paper, we developed an online multi-agent patrolling algorithm for large partial observable and stochastic environment where the information are distributed with threats. Specifically, a predictive heuristic is defined to evaluate the policies of looking ahead several steps. For the multi-agent algorithm, we extended the sequential policy computation method for individual agents to deal with partially observable problems. We empirically showed that for 10 agents in a large graph, our algorithm outperforms the baseline algorithm by more than 44%. In our future work, on the one hand, as this is the first algorithm for patrolling with uncertainty and threats, we plan to devise a better heuristic and algorithms that provide theoretical performance guarantees in our future work. One the other hand, as our formulation is a basic model of UAVs patrolling under uncertainty and threats, we will consider that the communication system of the agents may locally break down by suffering from harms or some agents may get destroyed due to cumulated harms.
Supporting Information S1 Video. Video to show the simulation of 15 agents patrolling problem. (MP4) S1 Code. Java code to implement the scenarios and computations in the paper.