Adversarial Online Learning with Variable Plays in the Pursuit-Evasion Game: Theoretical Foundations and Application in Connected and Automated Vehicle Cybersecurity

We extend the adversarial/non-stochastic multi-play multi-armed bandit (MPMAB) to the case where the number of arms to play is variable. The work is motivated by the fact that the resources allocated to scan different critical locations in an interconnected transportation system change dynamically over time and depending on the environment. By modeling the malicious hacker and the intrusion monitoring system as the attacker and the defender, respectively, we formulate the problem for the two players as a sequential pursuit-evasion game. We derive the condition under which a Nash equilibrium of the strategic game exists. For the defender side, we provide an exponential-weighted based algorithm with sublinear pseudo-regret. We further extend our model to heterogeneous rewards for both players, and obtain lower and upper bounds on the average reward for the attacker. We provide numerical experiments to demonstrate the effectiveness of a variable-arm play.


I. INTRODUCTION
C URRENTLY , the world is experiencing an evolution from the traditional transportation system to the next generation of intelligent transportation systems (ITS). ITS aims to satisfy the ever-increasing need for mobility in major cities, which has caused growing traffic congestion, air pollution, poor user experience and crashes. Developing a sustainable intelligent transportation system requires better usage of the existing infrastructures and their seamless integration with information and communication technologies (ICT). Enabled by the recent findings in the areas of telecommunications, electronics, and computing capabilities in recent decades, the subsystems (infrastructures and vehicles) in ITS are expected to interoperate and communicate with each other, in order to provide a better and safer traveling experience [1].
The interconnection between the infrastructures and the vehicles relies on various types of sensors to provide state information and situational awareness. However, this has also increased the vulnerability of these advanced systems to cyber attacks. For instance, recently there have been demonstrated cyber attacks on vehicle sensors in [2], [3], where the authors used optimization-based approaches to fool the light detection and Ranging (LiDAR) sensors on the vehicle. At the system level, the infrastructures and the vehicles can be viewed as individual nodes in a large interconnected network, where a single malicious attack on a subset of sensors of one node can easily propagate through this network, affecting other network components (e.g., other vehicles, traffic control devices, etc.). For example, Feng et al. [4] demonstrated that by sending falsified data to actuated and adaptive signal control systems, a malicious hacker could increase the total system delay in a real-world corridor. Therefore, there is an increasing need for cyber security solutions, especially for sensor security solutions, to enhance the safety and reliability of the entire system.
Cyber security is an extremely broad topic. However, previous work on cyber security in the realm of ITS mainly fo-cuses on either attack or the defense strategies. For instance, there exists a large body of research illustrating the potential risks of connected and automated vehicle (CAV) technologies that result in anomalous/false information [5]- [8]. In the case of CAV sensor security, several critical sensors are illustrated in [9], including differential global positioning systems (GPS), inertial measurement units, engine control sensors, tyre-pressure monitoring systems (TPMS), LiDAR, and camera. Meanwhile, CAVs require more engine control units (ECUs) and many features of CAVs require complex interactions between multiple ECUs, which may potentially expose more vulnerabilities compared to non-CAVs. There also exist several studies assessing the potential threats on the transportation infrastructure [4], [10], [11]. For example, field devices such as traffic signals and roadside units are susceptible to tampering. The aforementioned literature illustrates the potential threats of sensor attacks to connected transportation systems.
Besides threat detection, prevention is normally recognized as one of the best defense strategies against malicious hackers or attackers. In order to deploy better prevention mechanisms, behaviors of both the attacker and the defender have to be considered so that the attack profile can be predicted. There is a gap in the literature in considering both the attacker and the defender and the adaptive interactions between them when devising defense strategies, which this paper aims to bridge.
Moreover, as more sensors are mounted aboard CAVs or installed on the transportation infrastructure, it becomes more difficult to monitor the sensors continuously, mainly due to limited resources. Although there is a large body of literature addressing sensor security in ITS [12]- [16], most of them mainly focus on sensor intrusion/anomaly detection without attack profile analysis, which considers which sensor is more vulnerable and should be protected. In this study, we address this by modeling attacker and defender behaviors in a game theoretical framework. Specifically, instead of considering intrusion/anomaly detection for all sensors in the system, we model attack and defense behaviors in order to predict which subset of sensors are more likely to be compromised. To be more practical, we consider a dynamic resource constraint for the defender. We model this problem as a sequential evasionand-pursuit game between two players. Consider the intrusion monitoring system of a sensor network as the defender. At each time, the defender selects a subset of sensors to scan, while the number of selected sensors changes based on the environment and scanning history, among other factors. Meanwhile, a hacker, considered as the attacker, attempts to select a sensor to compromise without being scanned by the defender. We assume that both the attacker and the defender are able to learn their opponent's behavior adaptively and with only partial information over time, and investigate the the resulting decision problem.
The main contributions of this work are as follows: First, in order to predict the attack profile, we model the behaviors of the attacker and the defender as the adversarial (or non-stochastic) multi-armed bandit (MAB) problem and the multi-armed bandit problem with variable plays (MAB-VP), where the two players are playing a constant-sum game against each other. To the best of our knowledge, this is the first study of MAB-VP in the non-stochastic setting. Second, we derive conditions under which a Nash equilibrium of the strategic game exists. For the defender, we provide an exponential-weighted algorithm, which is shown to have sublinear pseudo-regret. Finally, we consider a more realistic setting where the rewards are heterogeneous among different sensors, and derive lower and upper bounds on the attacker's average reward.

II. LITERATURE REVIEW
In this paper, we explore online learning algorithms in the class of adversarial or non-stochastic multi-armed bandit (MAB) problems. The adversarial MAB problem was first addressed by Auer et al. [17], where they also proposed the well-known exponential-weight algorithm for exploration and exploitation (Exp3). Exp3 runs the Hedge algorithm, which was originally proposed by Freund and Schapire [18] as a subroutine. Since then, there have been several extensions to this class including the online shortest path problem [19], routing games [20], bandit online linear optimization [21], and combinatorial bandits [22].
The multi-play multi-armed bandit (MPMAB) problem is another research direction for MAB. In this extension, a fixed number of resources (i.e., arms) are allocated at each time step. The MPMAB has attracted a lot of interest and several studies have been conducted along this direction [23]- [26]. However, most of these studies only focus on a stochastic setting. There is much less concentration on the adversarial MPMAB problem: Cesa-Bianchi and Lugosi [22] considered combinatorial bandits in the adversarial setting, where they proposed the ComBand algorithm. This algorithm has a sublinear regret in O M Only a limited number of studies have considered variable plays. Fouché et al. [28] proposed a scaling algorithm combined with a MAB algorithm , which they call the S-MAB algorithm. In this algorithm, the number of arms played at each time changes in order to satisfy an efficiency constraint. However, although the authors considered a dynamic environment, the S-MAB algorithm uses a stochastic setting, where they assume an unknown distribution of reward for each arm. Another work addressing the variable plays problem was done by Lesage-Landry and Taylor [29], where they extended the stochastic MAB to stochastic plays setting, i.e. the number of arms to play evolves as a stationary process. Both these studies only considered a stochastic setting, and did not conduct any game strategy analysis.
Although there is a wealth of research on using game theory in the transportation literature, very few studies applied game theory in ITS cybersecurity. Sedjelmaci et al. [30] conducted a survey on recent studies utilizing game theory to protect ITS from attacks, which is to the best of our knowledge the only survey paper on this topic. However, without considering the adaptive behavior of opponents, the current literature mostly models the cybersecurity problem as a non-repeated game, such as the Stackelberg security games (SSG) [31], [32], zero-sum games [33], [34], or Bayesian games [35], [36]. The solutions from these types of models are typically in the form of equilibria with an implied assumption that the players have knowledge of their opponent's actions/beliefs. Instead, we formulate this cybersecurity problem as a sequential pursuit-evasion game, which is also in the realm of algorithmic learning theory. There have been several studies of the pursuit-evasion problem [37]- [39]. However, they either lack robustness against adaptive changes in the adversarial behavior, or do not consider multiple plays, variable plays, dynamic resource allocation, or heterogeneous rewards.
Since the behavior of the adversarial opponent usually cannot be described in a stochastic way, in this paper we study the MAB-VP problem in a non-stochastic setting, where we propose the Exp3.M with variable plays (Exp3.M-VP) algorithm. Next, we consider a game setting for two players, and show that a Nash equilibrium of the strategic game exists. Finally, we consider heterogeneous rewards for both players and derive lower and upper bounds for the attacker's average reward. Numerical analyses are conducted in order to further demonstrate our results.

III. SYSTEM MODEL AND PROBLEM FORMULATION A. SYSTEM MODEL
Consider the repeated pursuit-evasion game between an attacker and a defender in discrete time. At each time step t, the attacker selects one of the N locations, indexed by the set N = {1, 2, ..., N }, to hide in (e.g., compromise a sensor), while the defender searches M t locations simultaneously, where 1 ≤ a ≤ M t ≤ b < N . The behaviors of the attacker and the defender are described by their respective set of marginal probabilities α(t) = (α k (t)) k∈N and β(t) = (β k (t)) k∈N , where α k (t) and β k (t) are the respective probabilities that the k-th location is chosen by the attacker and the defender at time t. Note that α(t) and β(t) represent the adversarial behavior with respect to one's opponent at time t, where they can describe randomized strategies of the players, or a probabilistic belief held by one side about the likelihood of an action by the other side.
Define two sets of binary variables x k (t) and y k (t) such that x k (t) = 1 if the defender does not search location k at time t, and x k (t) = 0 otherwise. Similarly, y k (t) = 1 if the attacker compromises the location k at time t, and y k (t) = 0 otherwise. When the attacker (defender) does not know the type of algorithm/strategy the opponent uses, it may regard the x k (t) (y k (t)) as a predetermined but unknown number. When the attacker (defender) does have this information, it may regard the x k (t) (y k (t)) as a random variable, where P (x k (t) = 0) = β k (t) (resp. P (y k (t) = 1) = α k (t)). The game is played in a sequence of trials t = 1, 2, ..., T . In this work we consider the case that neither the attacker nor the defender knows the strategy adopted by the other player. As will be discussed later, they have to choose the location based on the their history rewards.

B. PROBLEM FORMULATION: PARTIAL INFORMATION GAME
In this study we consider the scenario where both players have limited information on the adaptive behavior of their opponent. Define π = (π t , t = 1, 2, ...) as the control policy of the attacker, and let Π denote the policy space. Denote the location selection (action) sequence as I = (I t , t = 1, 2, ...) under policy π and |I t |= 1. At each time and under policy π t , the attacker chooses one location I t ∈ N to attack, i.e., where (ω(t), t = 1, 2, ...) denotes the randomized strategy of the attacker. Let x k (t) be the state of location k for the attacker at time t. Then the attacker scores the corresponding reward r I (t) = x It (t). The attacker observes only the reward r I (t) for the chosen action I t .
The attacker receives an expected reward E[r I (t)] = 1 − β It (t) at time t, which is the mean number of successful attacks at the chosen location. Note that in this section we consider a homogeneous reward across all locations; however, heterogeneous location-dependent rewards are considered in section VII. In this study, we assume a 100% success rate for both attacks and detection attempts. Then, within the time window {t, t = 1, 2, ..., T }, the attacker considers the following maximization problem, where the expectation is with respect to the randomness of the system state and the mixed-strategy of the attacker. We assume that the defender can scan M t locations at time t. Define γ = (γ t , t = 1, 2, ...) as the control policy marginal probability that the attacker compromises/the defender scans location k at time t indicator variable of whether the defender/attacker selects the location k at time t It/Jt index of the locations where the attacker compromises/the defender scans at time t Mt number of locations scanned by the defender at time t a/b lower/upper bound of Mt r(t)/s(t) single step reward of the attacker/defender private randomization device of the attacker/defender πt/γt control policy of the attacker/defender similarly defined, and (θ(t), t = 1, 2, ...) denotes the randomized strategy of the defender. Let y k (t) be the state of location k for the defender at time t. The defender also observes only the rewards j∈Jt y j (t) of the selected action J t . Denote the total rewards at time t of the defender given the location selection sequence J as s J (t) = j∈Jt y j (t). The defender therefore receives the expected reward E[s J (t)] = j∈Jt α j (t) at time t. This expected reward represents the mean number of detected attacks among M t number of scanned locations.
We assume that the number of arms M t the defender plays at each time is determined by a scaling function, i.e. f : R N +1 − → {a, a + 1, ..., b}, of the d-moving average of the rewards of each arm, where a and b are integers, and 1 ≤ a ≤ b < N . We also assume that M t is a function of the environment constraint L t , since in reality checking a location (e.g., scanning a specific sensor/unit in a CAV) may consume resources. Then, given the time horizon T , the defender is trying to solve the following constrained optimization problem: , andŷ d i is the dmoving average of the rewards of each arm i. Using a moving average of reward can allow us to capture the history reward while at the mean time capturing the dynamic change of the reward for each location, allowing the scaling function to adjust the number of arms to play each time. The expectation is with respect to the randomness of the system state and the mixed strategy of the defender. Note that there is no requirement for the scaling function f , other than it needs to be bounded by integers a and b. Furthermore, L t can be an arbitrary integer between a and b, thereby capturing any set of environmental conditions. When the defender knows the type of strategy the attacker uses, it may regard y J j (t) as stochastic, i.e. assuming the attacker chooses location j with probability P (y J j = 1) = α j (t). Note that this is different from the stochastic MAB setting where a fixed (time-invariant) distribution of rewards for each arm is assumed. However, here we do not assume neither the defender nor the attacker have information about their opponent's strategy. Hence, the difficulty is that the defender can only estimate α j (t) by imposing an arbitrary belief on the adversarial behavior based on previous observations and rewards. Furthermore, here, we do not make any assumptions about the distribution of α j (t). Receive the number of arms to play at each round Mt.

IV. ALGORITHMS FOR THE ATTACKER AND THE DEFENDER
We assume the attacker adopts Exp3 proposed by Auer et al [17]. (However, as we are going to show later in section VI, the equilibrium of the two-player game does not depend on any proprieties of the algorithm other than a noregret guarantee.) The Exp3 algorithm uses an efficient and randomized policy to select only one arm at each time t. The adversarial single play bandit problem is closely related to the problem of learning to play an unknown repeated matrix game. In this setting, a player without prior knowledge of the game matrix is to play the game repeatedly against an adversary with complete knowledge of the game and unbounded computational power. The basic idea of Exp3 is that at each time the player uses a randomized policy such that the adversarial player cannot know the exact choice of the player before she/he plays. For the details of Exp3, refer to the Appendix A.
Unlike the attacker who selects a single location to attack, we assume the defender can search multiple number of locations, which may vary at each time. Both sides seek to maximize their respective total rewards. At the beginning of a time step, each side needs to decide which location(s) to target, and cannot change their selection until the next time step. We develop a variable-play extension of the Exp3.M algorithm for the defender, which we call Exp3.M-VP, as detailed in Algorithm 1. In the Exp3.M-VP algorithm, let S denote the set of selected locations, and let S c define its complement set. Under the non-stochastic assumption and at each time step, the Exp3.M-VP algorithm consists of the following two procedures: 1) Receive M t , which is determined by the scaling function f and could be based on the environment constraint L t as well as the history rewardsŷ d (t) at time t, among other factors. Note that function f can take any form, and defining its exact form is outside the scope of this paper. Here, we assume M t is provided. 2) Apply an adversarial MPMAB algorithm which selects M t arms (locations) to play.
For the second procedure, we use the Exp3.M algorithm as a subroutine of the Exp3.M-VP algorithm. The Exp3.M is proposed by Uchiya et al. [27] and is an extension of the algorithm Exp3 for the adversarial MPMAB setting. In contrast to the Exp3 algorithm which selects one arm at each time, Exp3.M randomly selects a fixed number of M arms at each time. Note that both Exp3 and Exp3.M suffer from sublinear (weak) regret, or no-regret. In order to make sure that the probability of selecting location i by DepRound at step 12, i.e.α i (t), does not exceed 1, the Exp3.M-VP algorithm checks whether all w j (t)'s are less than (1−η) at step 5. If that is the case,α i (t) calculated at step 11 will be less than 1 for all i = 1, 2, ..., N without any weight modification, and the set S 0 (t) is set to ∅ at step 8. Otherwise, all the actions i with w i (t) ≥ κ t are classified into S 0 (t) and set to κ t at step 6.
Doing this, we haveα i (t) = 1 for all i ∈ S 0 (t). The subroutine DepRound [40] at step 12 draws M t out of N items with the specified marginal distribution (α 1 ,α 2 , ...,α N ), and is included in Appendix B.

V. ADAPTIVE LEARNING OF THE DEFENDER
In this section, we address the adaptive learning of the defender. Based on Algorithm 1 for the defender, the problem (4) can be recast by removing the constraint set, since will divide the problem to a scaling procedure and the MAB-VP. Formally, let y(t) := (y k (t), ∀k ∈ N ) for t = 1, ..., T over a finite horizon T . For any search sequence of the defender J = (J t , t = 1, 2, ...) and a fixed sequence of attacks by the attacker (y(1), y(2), ...), the total reward of the defender at T , denoted by G J (T ), is given by Here, we obtain the maximum reward by consistently searching the subset A Mt , which is the most attacker-active location set at each time step t with cardinality M t : The regret is then defined as When a = b, i.e. M t is time-invariant, the above regret reduces to the standard regret of MPMAB problem.
Since we care more about the competition against the optimal action in expectation, we define the pseudo-regret for our MAB-VP problem following the definition of pseudoregret in [41] as: where the expectation is with respect to the randomness of the system state and the mixed-strategy of the defender.
holds for any assignment of rewards and for any T > 0.
Proof. See Appendix C. VOLUME 4, 2016 By appropriately choosing the parameter η, we can obtain the following corollary: holds for any T > 0 and for any assignment of rewards.
The proof of Corollary 1.1 is the same as that of Corollary 3.2 in [42]. For the proof of Corollary 1.1, see Appendix D. Note that when a = b, the upper bound in Corollary 1.1 is the same as the upper bound of Exp3.M in [27], and when a = b = 1 the upper bound becomes the same upper bound obtained for Exp3 in [17].
as the average reward of the defender over infinite time horizon. Using the same parameter η as in Corollary 1.1, when the defender uses the Exp3.M-VP algorithm against the attacker who adopts a no-regret algorithm, we haves ∞ = ν N if M t is a wide sense stationary process with mean ν.
In order to prove Corollary 1.2, we need the following lemma, which was originally derived in [39]. Then the proof of Corollary 1.2 is as follows.
Proof. The above problem is equivalent to the problem of two players playing an unknown repeated bimatrix game, where the game value v i,t (i = 1, 2 for the row and column player respectively) is changing over time. Define the game matrices as two N × N matrices B and C, where B ij + C ij = 1 for any (i, j) ∈ N × N . At each time t, the defender (i.e., the row player) chooses J t rows of the matrix, and at the same time, the attacker (i.e., the column player) chooses exactly one column I t = k. The defender then receives the payoff j∈Jt B jk = j∈Jt y j (t). The defender uses a mixed strategy p t at each time t, where p t ∈ [0, 1] N , and the attacker chooses according to a probability vector q t ∈ [0, 1] N . Note that the sum of p t equals M t and the sum of q t equals 1. Let v 1,t be the game value of the game matrix B at time t. Then by Corollary 1.1, we have Then we have where q t is a distribution vector whose I t -th component is 1.
Combining (10) and (11), we have Note that at each time t, v 1,t = M t v 1 , where v 1 is the game value when the defender only chooses one location. Hence, by taking the limit of (12) and according to the law of large numbers we havē where the first equality comes from the fact that the attacker is also adopting a no-regret algorithm (e.g. Exp3). Finally, according to Lemma 2, we obtain the result.
Corollary 2.1. Under the setting that the defender adopts Exp3.M-VP and the attacker adopts a no-regret algorithm, assuming that M t is a wide sense stationary process with mean ν, each player adopts the best response for the infinitehorizon problem.
The proof can be obtained by extending the proof of the defender side in Corollary 1.2 to both sides, and is omitted for brevity. Note that in Corollary 1.2 and Corollary 2.1 we do not specify which type of learning algorithm the attacker is using, and the only assumption is that the attacker adopts a no-regret algorithm.

VI. ADAPTIVE LEARNING OF THE ATTACKER
We assume that the attacker adopts the Exp3 algorithm to randomly attack one location at each time step. The Exp3 algorithm runs the algorithm Hedge as a subroutine. Unlike the Hedge algorithm which directly takes advantage of the full information of the reward vector x(t) := (x i (t), ∀i ∈ N ), Exp3 observes partial information and feeds the simulated reward vectorx(t) := (x i (t), ∀i ∈ N ) to the Hedge. The Hedge will then updateβ i (t), which is the prediction of probability β i (t) for i ∈ N . For more details about the Exp3 and Hedge algorithms, see Appendix A.
The defender adopts the Exp3.M-VP algorithm, which has a sublinear regret, as shown in Theorem 1. As a result, if the attacker favors one location, intuitively the defender will eventually identify this most attractive location, and fails to scan it only at a rate no more than sublinear in T . When M t is a time-invariant constant, it follows immediately that the best strategy for the attacker over an infinite time horizon is to treat each location equally, either in a stochastic or deterministic way. However, when M t is a variable, the same argument cannot be trivially made.

Theorem 3. Definer ∞ := lim inf
T →∞ E 1 T T t=1 r I (t) , and let the location sequence g be the sequence of the greedy policy π greedy , where g(t) = arg min i∈Nβi (t) for all t. If M t is bounded by two positive integers a, b such that M t ∈ {a, a + 1, ..., b}, then under any policy π we have: and under the greedy policy π greedy , Proof. See Appendix E.
Note that by Corollary 1.2, we can directly obtain the following result, Corollary 3.1. Under the setting that the defender adopts Exp3.M-VP, the attacker adopts Exp3, and M t is a wide sense stationary process with mean ν, we haver ∞ = N −ν N . Moreover, when M t is a wide sense stationary process, following the proof of Theorem 3, it is not hard to show that even the greedy policy can obtainr ∞ = N −ν N . Note that the above argument does not require Exp3.M-VP to have any property other than a no-regret guarantee, and therefore the greedy policy for the attacker can be a countermeasure against the entire family of no-regret algorithms. For the defender part, according to Corollary 1.2 and Corollary 3.1, a straightforward path to increase the average reward in an infinite time horizon is to increase the value of ν, i.e., assign more resources to the intrusion monitoring system.

VII. ADAPTIVE ADVERSARIAL LEARNING WITH HETEROGENEOUS REWARDS
In this section we consider heterogeneous rewards that are location-dependent. This corresponds to a more general setting, since in reality some locations (e.g., sensors) are more critical to the system than others. Let µ k be the locationdependent reward corresponding to the k-th location. That is, the rewards of the attacker and the defender are r I (t) = µ It x It (t) and s J (t) = j∈Jt µ j y j (t), respectively. Without loss of generality, we assume that µ 1 ≥ µ 2 ≥ ... ≥ µ N . We denote the frequency of location k being selected given the selection sequence I as d I k (T ) over a time horizon T , i.e., where c I k (T ) = |{t ≤ T : I t = k} | and d I k (T ) = c I k (T )/T . Note that c I k (T ) is the total number of times location k is selected by the attacker over horizon T given the selection sequence I.
Since the problem is no longer a constant-sum game under the setting of heterogeneous rewards, Corollary 2.1 and Corollary 3.1 cannot be directly applied. However, we can still show that when the reward for each location is heterogeneous, the average rewardr ∞ in an infinite time horizon is bounded within an interval determined by a, b, and µ k , k = 1, 2, ..., N .
Theorem 4. Given heterogeneous rewards, the average reward of the attackerr ∞ over an infinite time horizon is bounded within the interval K In order to prove Theorem 4, we need Lemmas 5 and 6, as follows. Let supp(d) = {k ∈ N : d k > 0} for any feasible solution d, and let K * be the cardinality of supp(d). Then we have the following lemmas: Lemma 5. For any optimal solution d * of problem (18), (i) µ k d * k = µ j d * j for any k, j ∈ supp(d * ), and (ii), supp(d * ) consists of the indices of locations with the K * highest µ.

Lemma 6. Problem (23) is lower bounded by
The proofs of Lemmas 5 and 6 can be found in the Appendices F and G, respectively. Now we shall give the proof of Theorem 4 as follows.
Proof. The average reward of the defender when using VOLUME 4, 2016 Exp3.M-VP is given by for any realization I. Then we have Therefore, by having T approach infinity, we havē for any policy π.
Consider the following optimization problem where ∆ N is the set of distributions over N and d = (d k , k ∈ N ). Let the optimal solution and its objective function value be d * and r max , respectively. Then we havē Without loss of generality, we assume that supp(d * ) = {1, 2, ..., K * }. Therefore, according to Lemma 5, we have Then the optimal value of problem (18) is given by (K * − a)/ K * j=1 1/µ j , which is increasing with respect to the value of K * = 1, 2, ..., N . This gives the upper bound ofr ∞ .
When the defender adopts Exp3M-VP, we have where G I Exp3 (T ) is the total reward of the attacker when adopting Exp3, and G max (T ) = max k∈N T t=1 x k (t) is the maximum total reward the attacker can gain when selecting a fixed location to attack. Similarly Thus, the average rewardr ∞ of the attacker over an infinite time horizon is lower bounded bȳ Consider the following optimization problem and denote the optimal value of problem (23) as r min . Then according to Lemma 6, , which gives us the lower bound ofr ∞ . Theorem 4 is practical when the attack success rate of the attacker is not 100 percent for all locations, where µ k represents the success rate of attacks on location k for the attacker. Note that although Theorem 4 assumes heterogeneous rewards, it can be simply applied to homogeneous rewards as well. Figure 1 shows the range for the attacker's average reward in an infinite time horizon under different attack success rates, where we assume the same attack success rate for all locations for simpler visualization. Note that we do not even assume that M t is a wide sense stationary process; the only assumption here is that it is confined within a range with lower and upper bounds a and b, respectively. The shaded blue region in Figure 1 indicates the potential reward the attacker can obtain in infinite time, and the red and blue lines indicate the lower and upper bounds on the attacker's average reward in infinite time, according to Theorem 4. When the attack success rate is 1, the lower and upper bounds become equivalent to the bounds in Theorem 3. It is straightforward to see that the lower the success rate of the attack, the safer the system will be.

VIII. NUMERICAL ANALYSIS
We conducted extensive simulations illustrating the performance of the proposed algorithm and policy. Our numerical analysis consists of three parts. In section VIII-A, we conduct simulations to test the Exp3.M-VP performance under a single-player setting. In section VIII-B, we compare the performance of Exp3.M-VP with several bandit learning algorithms, i.e., the Exp3, Exp3.M, upper-confidence Bound (UCB) [43], and -greedy algorithms [44], on real in-vehicle network datasets from the Car-Hacking datasets [45]. In section VIII-C, we run simulations on the proposed game model and algorithmic solutions.

A. SIMULATIONS ON A SINGLE PLAYER
In this section we consider the single-player setting, where the Exp3.M-VP algorithm was evaluated on a ten-armed bandit problem with rewards for arms drawn independently from Bernoulli distributions with means {0.75, ..., 3 4k , ..., 0.075}, with k = 1, 2, ..., 10. This scenario was simulated over a fixed time horizon T = 20, 000 time steps. The number of arms played at each time step is drawn independently from a discrete uniform distribution over {1, 2, 3}. Parameter η is set to 0.1. Figure 2a shows the regret of Exp3.M-VP versus the expected upper bound of the regret from Theorem 1. We can see that the actual regret of Exp3.M-VP has a smaller rate than its expected upper bound and the discrepancy becomes larger as time increases. Figure 2b shows the change of the normalized weight for each location over the entire time horizon. As shown in this figure, Exp3.M-VP chooses the top three locations (i.e. the blue, orange, and green curves) with the highest average reward only after a short period of time, and the rest of weights vanish to nearly 0. The reason why only three locations pop up is that M t , i.e. the number of the arms played at each time, is within the set {1, 2, 3}. The fluctuations of the weights are partly due to the fact that the Exp3.M-VP algorithm needs to explore different locations in order to update the choice prediction and estimation, and partly due to the fact that the sum of the weights must always equal to M t , which is changing over time.

B. EVALUATIONS ON CAR-HACKING DATASET FOR THE DEFENDER
In this section we compare Exp3.M-VP with Exp3, Exp3.M, UCB, and the -greedy algorithms by implementing these algorithms over two in-vehicle network datasets from the Car-Hacking datasets. The Car-Hacking datasets are generated by logging the Controller Area Network (CAN) traffic via the OBD-II port from a real vehicle while message injection attacks were made. The Datasets each contain 300 intrusions of message injections over 26 unique CAN IDs. Each intrusion is performed for 3 to 5 seconds, and each dataset has a total of 30 to 40 minutes of the CAN traffic. Specifically, we test the performance on the spoofing attack datasets, which were conducted on the RPM gauze and the driving gear. That is, among 26 arms representing CAN IDs, two of them (RPM gauze and driving gear) contained spoofing attacks. Figure 3 shows the cumulative average rewards for each bandit learning algorithm used by the defender. The experiments were conducted over T = 7, 000 time steps, and the number of arms played by Exp3.M-VP was sampled from a truncated Gaussian distribution within the interval [1,3], with mean 2 and standard deviation 0.8. The number of arms played by Exp3.M was set to 3. We can see that both Exp3.M and Exp3.M-VP obtain higher cumulative average rewards than other single-play setting algorithms, due to the benefits from multiple or variable plays.    Figure 4 shows the results. This figure demonstrates the average reward of the two algorithms under four values for M and ν. We can see that the performance of the two algorithms are very close, mainly due to the fact that M = ν. Note that here, in some instances Exp.M-VP will have access to less resources/arms, and in some instances more. As a result, throughout the iterations, sometimes Exp3.M outperforms Exp3.M-VP, and sometimes it underperforms. However, eventually both algorithms reach the same reward and successfully identify

C. SIMULATIONS ON TWO PLAYERS
We now consider a game setting where two players, i.e., an attacker and a defender, are playing the pursuit-evasion game against each other. This corresponds to the realistic scenario where a malicious hacker is trying to compromise either the sensor/ECU in an in-vehicle sensor network, or the entire vehicle/infrastructure in an interconnected transportation system without being identified by the intrusion monitoring system. At the same time, the intrusion monitoring system is trying to identify as many compromised locations as possible to minimize the potential loss. We consider a ten-armed bandit problem for the two players, where the attacker adopts Exp3 and the defender adopts Exp3.M-VP. The scenario was simulated over T = 100, 000 time steps, and the number of arms played by the defender was sampled from a truncated Gaussian distribution within the interval [1,3], with mean 2 and standard deviation 0.8. The parameter η for both Exp3 and Exp3.M-VP was set according to Corollary 1.1. Figure 5 illustrates the average reward and the equilibrium reward for the two players. Since we have N = 10 and ν = 2, according to Corollary 1.2 and Corollary 3.1, the equilibrium rewards for the attacker and the defender are 0.8 and 0.2, respectively. We can see that the average rewards of both players converge to the equilibrium rewards after a relatively short period, and after that the average rewards stay around the equilibrium reward with small fluctuations. The fluctuations are due to the fact the Exp3 and Exp3.M-VP use randomized policies and need to occasionally explore different locations in order to update the choice predictions and estimations.

IX. CONCLUSIONS
In this paper, we extend the adversarial/non-stochastic MPMAB to the case where the number of plays can change in time, and propose the Exp3.M-VP algorithm for obtaining the variable-play property. This extension is motivated by the uncertainty of resources allocated to the intrusion monitoring system to scan at each time in resource-constrained systems, such as an interconnected transportation system. We derive a sublinear regret bound for Exp3.M-VP, which simplifies to the existing bounds in the literature when the number of arms played at each time is constant. We introduce a game setting where an attacker and a defender play a pursuit-evasion game against each other. The defender, who represents the intrusion monitoring system, adopts Exp3.M-VP and the attacker, who represents the malicious hacker, adopts Exp3. We derive the condition under which a Nash equilibrium of the strategic game exists. Finally, we consider heterogeneous rewards for arms, and obtain lower and upper bounds on the average rewards for the attacker in an infinite time horizon. We provide several numerical experiments that demonstrate our results.
This work provides insights on deploying an intrusion monitoring system either in an in-vehicle network or a transportation network: In order to minimize the potential loss of the system from cyber threats, one can either increase the average resources allocated to intrusion monitoring, or change the potential reward vector for each location to reduce the reward bound in Theorem 4. One of the potential extensions of this work is to consider the connectivity or correlations between different arms, which can take into account the spread of the cyber attacks, and use such information to facilitate the decision making of the intrusion monitoring system.
. Choose action It according to the distribution

5:
Receive the reward vector x(t) and score gain xI t (t).
On the other hand, define A * b as the best location index subset with b elements. Then, where inequality (26a) holds because A * b ⊆ N , inequality (26b) comes from the inequality of arithmetic and geometric means, i.e. , and inequality (26c) is obtained by recursively applying step 15 of Algorithm 1, which results in equality (27): Note that we also have where inequality (28a) is due to the fact thatŷ j (t) = y j (t), ∀j ∈ S 0 (t), and the last inequality (28b) holds because η ∈ (0, 1]. Combining (25c), (26c), (28a), and (28b), we have: Taking expectations of both sides of inequality (29), we obtain where inequality (30b) uses the fact that E[ŷ i (t)|S(1), ..., S(t− 1)] = y i (t), and T t=1 i∈N Since A * b = ∪ Mt A * Mt trivially holds, we have Therefore, by combining (30) and (32), we obtain the inequality stated in the Theorem 1.

APPENDIX E PROOF OF THEOREM 3
Proof. Note that for any policy π of the attacker, where the last inequality (34d) comes from the fact that G max (T ) ≥ T a N for any defender's policy γ.
Under the greedy policy we haveβ g(t) (t) ≤ b N , which implies r(t) ≥ N −b N for any t. Therefore by using the greedy policy π greedy , we haver ∞ ≥ N −b N .

APPENDIX F PROOF OF LEMMA 5
The proof of Lemma 5 is an extension of the proof of Lemma 4 in [39]. The main difference is that the matrix H is now an N × |C(N , a)| matrix compared to the one in the original proof which is N × N .
Proof. 1) The problem (18)   where each column j represents one set S ⊆ C(N , a) such that for all i ∈ S, H ij = 0 and for all i ∈ N \ S, H ij = µ i . The remaining proof is the same as the original proof. Now consider a zero-sum game with the payoff matrices for the row and the column players being H and −H, whose mixed strategy vectors are d and u, respectively. Any optimal solution d * to the problem (18) is a Nash equilibrium strategy VOLUME 4, 2016 for the row player, and by the indifference condition, we obtain for any j ∈ supp(d * ), k =j k ∈ supp(d * ) = Const., (37) which implies µ k d * k = µ j d * j for any k, j ∈ supp(d * ).
2) The second part of the Lemma is proved by contradiction. Assume that there exist i ∈ supp(d * ) and j ∈ N \ supp(d * ) such that µ j > µ i . Let o be a constant such that o = µ k d * k for any k ∈ supp(d * ). Then consider a feasible solution d, where d k = 0 for all k ∈ ((N \ supp(d * )) \ {j}) ∪ {i}, and d k = d * k + for all k ∈ (supp(d * ) \ {i}) ∪ {j}, with = d * i (1 − µ i /µ j )/K * , which yields a higher objective value.

APPENDIX G PROOF OF LEMMA 6
Proof. Consider the following linear program: It is easy to see that problem (23) is lower bounded by the problem (38). Then the dual of the program (38) can be written as Note that program (39) is equivalent to the following problem which is essentially problem (18), except for changing the set C(N , a) to C (N , b). Therefore problem (40) has the optimal value K * −b NEDA MASOUD is an Assistant Professor in the Department of Civil and Environmental Engineering at the University of Michigan Ann Arbor. She received her Ph.D. in Civil and Environmental Engineering from University of California Irvine. She holds an M.S. degree in Physics, and a B.S. degree in Industrial Engineering. Her research interests include large-scale optimization and machine learning with applications in shared-use mobility systems and connected and automated vehicle systems.