Exponential asymptotic optimality of Whittle index policy

Gast, Nicolas; Gaujal, Bruno; Yan, Chen

doi:10.1007/s11134-023-09875-x

Exponential asymptotic optimality of Whittle index policy

Published: 21 May 2023

Volume 104, pages 107–150, (2023)
Cite this article

Queueing Systems Aims and scope Submit manuscript

237 Accesses
1 Citation
Explore all metrics

Abstract

We evaluate the performance of Whittle index policy for restless Markovian bandit. It is shown in Weber and Weiss (J Appl Probab 27(3):637–648, 1990) that if the bandit is indexable and the associated deterministic system has a global attractor fixed point, then the Whittle index policy is asymptotically optimal in the regime where the arm population grows proportionally with the number of activation arms. In this paper, we show that, under the same conditions, this convergence rate is exponential in the arm population, unless the fixed point is singular (to be defined later), which almost never happens in practice. Our result holds for the continuous-time model of Weber and Weiss (1990) and for a discrete-time model in which all bandits make synchronous transitions. Our proof is based on the nature of the deterministic equation governing the stochastic system: We show that it is a piecewise affine continuous dynamical system inside the simplex of the empirical measure of the arms. Using simulations and numerical solvers, we also investigate the singular cases, as well as how the level of singularity influences the (exponential) convergence rate. We illustrate our theorem on a Markovian fading channel model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Whittle index of Markov modulated restless bandits

Article 27 June 2022

An asymptotically optimal strategy for constrained multi-armed bandit problems

Article 02 January 2020

Boundary Crossing Probabilities for General Exponential Families

Article 01 January 2018

Notes

The most efficient algorithm to test indexability and compute the index can be found in Gast et al [13]. For a given model with d states, the complexity of this algorithm is $o(d^3)$.
If two states or more had the same index, to specify an index policy, one would need a tie-breaking rule. Our proof would work if the tie-breaking rule defines a strict order of the states.
The code and parameters to reproduce all experiments and figures of the paper are available in a Git repository https://gitlab.inria.fr/phdchenyan/code_ap2021.
Recall that $\phi $ is an application from $\Delta ^d$ to $\Delta ^d$. This means in particular that all the rows of all matrices $\textbf{K}_i$ sum to 1. Therefore, each of these matrices have an eigenvalue 1. When we write "the norm of all eigenvalues of $\textbf{K}_i$ is smaller than 1", we mean 1 is an eigenvalue of $\textbf{K}_i$ and has multiplicity one; all other eigenvalues must be of norm strictly less than 1.
In what follows, we write $-0.4 \dots $ to mean a number that approaches $-0.4$.
We refer to our Git repository for a more thorough numerical exploration of this case.
The rates $Q^{a^n(t)}_{ij}$ are well defined because bandits evolve independently and that the probability that two arms evolve at the same time is 0.

References

Aalto, S., Lassila, P., Osti, P.: Whittle index approach to size-aware scheduling with time-varying channels. In: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 57–69 (2015)
Ansell, P., Glazebrook, K.D., Nino-Mora, J., et al.: Whittle’s index policy for a multi-class queueing system with convex holding costs. Math. Methods Oper. Res. 57(1), 21–39 (2003)
Article Google Scholar
Avrachenkov, K.E., Borkar, V.S.: Whittle index policy for crawling ephemeral content. IEEE Trans. Control Netw. Syst. 5(1), 446–455 (2016)
Article Google Scholar
Blondel, V.D., Bournez, O., Koiran, P., et al.: The stability of saturated linear dynamical systems is undecidable. J. Comput. Syst. Sci. 62(3), 442–462 (2001)
Article Google Scholar
Brown, D.B., Smith, J.E.: Index policies and performance bounds for dynamic selection problems. Manag. Sci. 66, 3029–3050 (2020)
Article Google Scholar
Darling, R., Norris, J.: Differential equation approximations for Markov chains. Probab. Surv. 5, 37–79 (2008)
Article Google Scholar
Duff, M.O.: Q-learning for bandit problems. In: Proceedings of the Twelfth International Conference on International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML’95, pp. 209–217 (1995)
Duran, S., Verloop, M.: Asymptotic optimal control of markov-modulated restless bandits. In: International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2018), vol 2. ACM : Association for Computing Machinery, Irvine, US, pp. 7:1–7:25 (2018)
Gast, N.: Expected Values Estimated via Mean-Field Approximation are 1/N-Accurate. In: ACM SIGMETRICS/ International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’17, Urbana-Champaign, United States, p. 26 (2017)
Gast, N., Van Houdt, B.: A refined mean field approximation. In: Proceedings of the ACM on Measurement and Analysis of Computing Systems 1(28) (2017)
Gast, N., Bortolussi, L., Tribastone, M.: Size expansions of mean field approximation: transient and steady-state analysis. In: 2018–36th International Symposium on Computer Performance, Modeling, Measurements and Evaluation, Toulouse, France, pp. 1–2 (2018)
Gast, N., Latella, D., Massink, M.: A refined mean field approximation of synchronous discrete-time population models. Perform. Eval. 126, 1–21 (2018)
Article Google Scholar
Gast, N., Gaujal, B., Khun, K.: Computing whittle (and gittins) index in subcubic time. arXiv preprint arXiv:2203.05207 (2022)
Gast, N., Gaujal, B., Yan, C.: Lp-based policies for restless bandits: necessary and sufficient conditions for (exponentially fast) asymptotic optimality (2022)
Gittins, J., Glazebrook, K., Weber, R.: Multi-armed Bandit Allocation Indices. John Wiley & Sons, Hoboken (2011)
Book Google Scholar
Gittins, J.C.: Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. B 148–177 (1979)
Hodge, D.J., Glazebrook, K.D.: On the asymptotic optimality of greedy index heuristics for multi-action restless bandits. Adv. Appl. Probab. 47(3), 652–667 (2015)
Article Google Scholar
Hu, W., Frazier, P.: An asymptotically optimal index policy for finite-horizon restless bandits (2017)
Kifer, Y.: Random Perturbations of Dynamical Systems. Progress in Probability. Birkhäuser, Boston (1988)
Book Google Scholar
Kurtz, T.G.: Strong approximation theorems for density dependent Markov chains. Stoch. Process. Appl. 6(3), 223–240 (1978)
Article Google Scholar
Larranaga, M., Ayesta, U., Verloop, I.M.: Dynamic control of birth-and-death restless bandits: application to resource-allocation problems. IEEE/ACM Trans. Netw. 24(6), 3812–3825 (2016)
Article Google Scholar
Lattimore, T., Szepesvári, C.: Bandit Algorithms. Cambridge University Press, Cambridge (2020)
Book Google Scholar
Liu, K., Zhao, Q.: Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Trans. Inf. Theory 56(11), 5547–5567 (2010)
Article Google Scholar
Meshram, R., Manjunath, D., Gopalan, A.: On the whittle index for restless multiarmed hidden Markov bandits. IEEE Trans. Autom. Control 63(9), 3046–3053 (2018)
Article Google Scholar
Niño-Mora, J., Villar, S.S.: Sensor scheduling for hunting elusive hiding targets via whittle’s restless bandit index policy. In: International Conference on NETwork Games, Control and Optimization (NetGCooP 2011). IEEE, pp. 1–8 (2011)
Ouyang, W., Eryilmaz, A., Shroff, N.B.: Asymptotically optimal downlink scheduling over Markovian fading channels. In: 2012 Proceedings IEEE INFOCOM, IEEE, pp. 1224–1232 (2012)
Papadimitriou, C.H., Tsitsiklis, J.N.: The complexity of optimal queuing network control. Math. Oper. Res. 293–305 (1999)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn. John Wiley & Sons Inc, New York (1994)
Book Google Scholar
Raghunathan, V., Borkar, V., Cao, M., et al.: Index policies for real-time multicast scheduling for wireless broadcast systems. In: IEEE INFOCOM 2008-The 27th Conference on Computer Communications, IEEE, pp. 1570–1578 (2008)
Verloop, M.: Asymptotically optimal priority policies for indexable and nonindexable restless bandits. Ann. Appl. Probab. 26(4), 1947–1995 (2016)
Article Google Scholar
Weber, R.R., Weiss, G.: On an index policy for restless bandits. J. Appl. Probab. 27(3), 637–648 (1990)
Article Google Scholar
Weber, R.R., Weiss, G.: Addendum to: On an index policy for restless bandits. Adv. Appl. Probab. 23(2), 429–430 (1991)
Article Google Scholar
Whittle, P.: Restless bandits: activity allocation in a changing world. J. Appl. Probab. 25A, 287–298 (1988)
Article Google Scholar
Ying, L.: Stein’s method for mean field approximations in light and heavy traffic regimes. POMACS 1(1), 1–27 (2017)
Google Scholar
Zayas-Caban, G., Jasin, S., Wang, G.: An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits. Adv. Appl. Probab. 51(3), 745–772 (2019)
Article Google Scholar
Zhang, X., Frazier, P.I.: Restless bandits with many arms: beating the central limit theorem (2021)
Zhang, X., Frazier, P.I.: Near-optimality for infinite-horizon restless bandits with many arms. arXiv preprint arXiv:2203.15853 (2022)

Download references

Acknowledgements

This work was supported by the ANR project REFINO (ANR-19-CE23–0015). Chen YAN would like to express his gratitude to Maaike Verloop for her warm hospitality at Toulouse INP during November 2019, and for the numerous engaging discussions on Whittle indices throughout his stay there.

Author information

Authors and Affiliations

Inria, CNRS, Grenoble INP, LIG, Univ. Grenoble Alpes, 38000, Grenoble, France
Nicolas Gast, Bruno Gaujal & Chen Yan

Authors

Nicolas Gast
View author publications
You can also search for this author in PubMed Google Scholar
Bruno Gaujal
View author publications
You can also search for this author in PubMed Google Scholar
Chen Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Yan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of Theorem 1

Proof

Let $\textbf{m}^*$ be the fixed point of $\phi $. As $\textbf{P}^0$, $\textbf{P}^1$ are rational, each coordinate of $\textbf{m}^*$ is a rational number. Let $\{ N_k \}_{k \ge 0}$ be a sequence of increasing integers that goes to $\infty $, such that for all $k \ge 0$ and all $1 \le i \le d$, $m^*_i N_k$ and $\alpha N_k$ are integers. We then fix an N from this sequence $\{ N_k \}_{k \ge 0}$. Recall that $m_i N$ is the number of arms in state i in configuration $\textbf{m}$ and that $S_n(t)$ is the state of arm n at time t. We use $\textbf{S}(t)$ to denote the state vector of the N arms system at time t. Let $\textbf{S}^*$ be a state vector corresponds to configuration $\textbf{m}^*$ with N arms. This is possible as $m^*_i N$ is an integer for all $i\in \{1, \dots , d\}$.

Note that in configuration $\textbf{m}^*$ (i.e., state vector $\textbf{S}^*$), an optimal action $\textbf{a}^*$ under the relaxed constraint (5) will activate exactly $\alpha N$ arms. As $\textbf{a}^*$ is sub-optimal compared to an optimal policy for the original N arms problem (2)-(3), we have

$$\begin{aligned} V^{(N)}_{\textrm{opt}} (\alpha )+ h(\textbf{S}^*)&= \max _{\textbf{a}\in \{ 0,1 \}^N} \bigg \{ \sum _{n=1}^{N} R^{a_n}_{S^*_n} + \mathbb {E}_{\textbf{a}} \Big [ V\big (\textbf{S}(1)\big ) \mid \textbf{S}(0) = \textbf{S}^* \Big ] \bigg \} \\&\ge \sum _{n=1}^{N} R^{a^*_n}_{S^*_n} + \mathbb {E}_{\textbf{a}^*} \Big [ V\big (\textbf{S}(1)\big ) \mid \textbf{S}(0) = \textbf{S}^* \Big ]\\&= N V^{(1)}_{\textrm{rel}}(\alpha ) + \mathbb {E}_{\textbf{a}^*}\big [h(\textbf{S}(1))\mid \textbf{S}(0)=\textbf{S}^* \big ], \end{aligned}$$

where in the above equation the function $V:\textbf{S}\rightarrow \mathbb {R}$ is the bias of the MDP. The first line corresponds to Bellman’s equation (see e.g., Equation 8.4.2 in Chapter 8 of Puterman [28]), the second line is because $\textbf{a}^*$ is a valid action for the N-arms MDP but might not be the optimal action, and the last line is because $\sum _{n=1}^{N} R^{a^*_n}_{S^*_n}=V^{(N)}_{\textrm{rel}}(\alpha )=N V^{(1)}_{\textrm{rel}}(\alpha )$.

We hence obtain

$$\begin{aligned} V^{(1)}_{\textrm{rel}}(\alpha ) \ge \frac{V^{(N)}_{\textrm{opt}} (\alpha )}{N} \ge V^{(1)}_{\textrm{rel}}(\alpha ) + \frac{\mathbb {E}_{\textbf{a}^*}V\big (\textbf{S}(1)\big ) - h(\textbf{S}^*)}{N}. \end{aligned}$$

In the following, we bound $\mathbb {E}_{\textbf{a}^*}[V\big (\textbf{S}(1)\big ) - h(\textbf{S}^*)]$. This will be achieved in two steps.

Step One

We define for two state vectors $\textbf{y}$, $\textbf{z}$ the distance

$$\begin{aligned} \delta (\textbf{y},\textbf{z}) := \sum _{n=1}^{N} \mathbbm {1}_{ \big \{ y_n \ne z_n \big \} }, \end{aligned}$$

which counts the number (among the N arms) of arms that are in different states between those two vectors. Such distance satisfies the property that for all $\textbf{y}$ and $\textbf{z}$ such that $\delta (\textbf{y},\textbf{z}) = k$, we can find a sequence of state vectors $\textbf{z}_1, \textbf{z}_2,\ldots , \textbf{z}_{k-1}$ that verify $\delta (\textbf{y},\textbf{z}_1) = \delta (\textbf{z}_1, \textbf{z}_2)=\cdots = \delta (\textbf{z}_{k-2},\textbf{z}_{k-1}) = \delta (\textbf{z}_{k-1}, \textbf{z}) = 1$. In what follows, we show that there exists $C>0$ independent of N such that for all state vectors $\textbf{y}$ and $\textbf{z}$, the bias function h(.) satisfies:

$$\begin{aligned} | h(\textbf{y}) - h(\textbf{z}) | \le C \cdot \delta (\textbf{y},\textbf{z}). \end{aligned}$$

In view of the above property of $\delta $, we only need to prove this for $\delta (\textbf{y},\textbf{z})=1$, i.e.,

$$\begin{aligned} | h(\textbf{y}) - h(\textbf{z}) | \le C. \end{aligned}$$

Let $\textbf{y}$, $\textbf{z}$ be two state vectors such that $\delta (\textbf{y},\textbf{z})=1$, and assume without loss of generality that it is arm 1 that are in different states: $y_1\ne z_1$ and $y_n=z_n$ for $n\in \{2\dots N\}$. We use a coupling argument as follows: We consider two trajectories of the N arms system, $\textbf{Y}$ and $\textbf{Z}$, that start, respectively, in state vectors $\textbf{Y}(0)=\textbf{y}$ and $\textbf{Z}(0)=\textbf{z}$. Let $\pi ^*$ be the optimal policy of the N arms MDP, and suppose that we apply $\pi ^*$ to the trajectory $\textbf{Z}$. At time t, the action vector will be $ \pi ^*(\textbf{Z}(t))$. We couple the trajectories $\textbf{Y}$ and $\textbf{Z}$ by applying the same action vectors $\pi ^*(\textbf{Z}(t))$ for $\textbf{Y}$ and keeping $Y_n(t)=Z_n(t)$ for arms $n\in \{2\dots N\}$. The $\textbf{Z}$ trajectory follows an optimal trajectory, hence Bellman’s equation is satisfied: for any $T>0$, we have:

$$\begin{aligned} T \cdot V^{(N)}_{\textrm{opt}} (\alpha )+ h(\textbf{z}) = \sum _{n=1}^{N} R^{\pi ^*_n (\textbf{z})}_{z_n} + \mathbb {E}_{\pi ^*} \bigg [ \sum _{n=1}^{N} \sum _{t=1}^{T-1} R^{\pi ^*_n (\textbf{Z}(t))}_{Z_n(t)} + V\big ( \textbf{Z}(T) \big ) \ | \ \textbf{Z}(0) = \textbf{z}\bigg ]. \end{aligned}$$

(17)

Since $\textbf{Y}$ follows a possibly sub-optimal trajectory, we have:

$$\begin{aligned} T \cdot V^{(N)}_{\textrm{opt}} (\alpha )+ h(\textbf{y}) \ge \sum _{n=1}^{N} R^{\pi ^*_n (\textbf{y})}_{y_n} + \mathbb {E}_{\pi ^*} \bigg [ \sum _{n=1}^{N} \sum _{t=1}^{T-1} R^{\pi ^*_n (\textbf{Z}(t))}_{Y_n(t)} + V\big ( \textbf{Y}(T) \big ) \ | \ \textbf{Y}(0) = \textbf{y}\bigg ], \end{aligned}$$

(18)

Recall that the matrices $\textbf{P}^0,\textbf{P}^1$ are such that a bandit is recurrent and aperiodic. This shows that the mixing time of a single arm is bounded (independently of N): for any policy $\pi \in \Pi $

$$\begin{aligned} \max _{i,j} {\mathop {{{\,\textrm{argmin}\,}}}\limits _{t}} \bigg \{ \mathbb {P}_{\pi } \Big [ Y_1(t) = Z_1(t) \ \Big | \ Y_1(0) = i, Z_1 (0) = j \Big ] > 0 \bigg \} < \infty . \end{aligned}$$

Because of the coupling, for $0 \le t \le T$ and $\ 1 \le n \le N$, $Y_n(t) \ne Z_n(t)$ is only possible for $n=1$. Furthermore, as the mixing time of an arm is bounded, for T large enough, there is a positive probability, say at least $p > 0$, that $Y_1 (T) = Z_1 (T)$. Hence, with probability smaller than $1-p$ we have $\delta \big (\textbf{y}(T), \textbf{z}(T)\big ) = 1$, conditional on $\textbf{Y}(0)= \textbf{y}$ and $\textbf{Z}(0)= \textbf{z}$.

Let $r:= 2 \max _{1 \le i \le d, a\in \{0,1\}} | R^a_i |$. Subtracting (17) in (18) gives

$$\begin{aligned} |h(\textbf{y}) - h(\textbf{z})|&\le T \cdot r + \Big |\mathbb {E}_{\pi ^*} \Big [ V\big (\textbf{Y}(T)\big )-V\big (\textbf{Z}(T)\big ) \ \Big | \ \textbf{Y}(0) = \textbf{y}, \textbf{Z}(0)=\textbf{z}\Big ] \Big | \\&\le T \cdot r + (1-p)\max _{\textbf{U},\textbf{V}: \ \delta (\textbf{U},\textbf{V})=1} \big \{ |h(\textbf{U})-h(\textbf{V})| \big \}. \end{aligned}$$

This being true for all $\textbf{y}, \textbf{z}$ with $\delta (\textbf{y},\textbf{z})=1$, it implies that $\max _{\textbf{U},\textbf{V}: \ \delta (\textbf{U},\textbf{V})=1} \big \{ |h(\textbf{U})-h(\textbf{V})| \big \} \le T \cdot r / p$, and we can take the constant $C:=T \cdot r/p$.

Step Two

Recall that the state vector $\textbf{S}^*$ corresponds to the optimal (relaxed) configuration $\textbf{m}^*$. We now prove that

$$\begin{aligned} \mathbb {E}_{\textbf{a}^*}[\delta (\textbf{S}^*,\textbf{S}(1)) \mid \textbf{S}(0)=\textbf{S}^*] \le D \sqrt{N}, \end{aligned}$$

with a constant D independent of N, where $ \textbf{S}(1) $ is the random vector conditional on $\textbf{S}(0) = \textbf{S}^*$ under action vector $\textbf{a}^*$.

Indeed, let $\textbf{x}^*:= \textbf{m}^* N$, and denote $\textbf{X}:= \textbf{m}(1) N$ to be the random d-vector, with $\textbf{m}(1)$ the random configuration corresponds to $\textbf{S}(1)$. For each $1 \le i \le d$, we may write

$$\begin{aligned} X_i = (B_{i,1}^0+B_{i,1}^1)+ (B_{i,2}^0+B_{i,2}^1) +\cdots + (B_{i,d}^0 + B_{i,d}^1) \end{aligned}$$

where $B_{i,j}^a \sim Binomial (x^*_{j,a}, P^{a}_{ji})$ for $1 \le j \le d$, $a \in \{ 0,1 \}$; and $x^*_{j,0} + x^*_{j,1} = x^*_j$, with $x^*_{j,a}$ representing the number of arms in state j taking action a, when optimal action vector $\textbf{a}^*$ is applied to state vector $\textbf{S}^*$.

By stationarity, we have

$$\begin{aligned} \mathbb {E}_{\textbf{a}^*} (X_i) = \sum _{j=1}^d \sum _{a=0,1} x^*_{j,a} \cdot P^{a}_{ji} = x^*_i, \end{aligned}$$

and

$$\begin{aligned} \text{ Var } (X_i) = \sum _{j=1}^d \sum _{a=0,1} x^*_{j,a} \cdot P^{a}_{ji} (1-P^{a}_{ji}) = \mathcal {O}(N). \end{aligned}$$

Consequently, we can bound

$$\begin{aligned} \mathbb {E}_{\textbf{a}^*}[\delta (\textbf{S}^*,\textbf{S}(1))] \le \sum _{i=1}^d \mathbb {E}_{\textbf{a}^*} \big | x^*_i - X_i \big | \le D \sqrt{N}, \end{aligned}$$

with a constant D independent of N.

To summarize, we have

$$\begin{aligned} \mathbb {E}_{\textbf{a}^*} \big [ |h(\textbf{S}(1)) - h(\textbf{S}^*)| \big ] \le \mathbb {E}_{\textbf{a}^*} \big [ C \cdot \delta \big (\textbf{S}(1), \textbf{S}^*\big ) \big ] \le CD \cdot \sqrt{N}, \end{aligned}$$

hence

$$\begin{aligned} V^{(1)}_{\textrm{rel}}(\alpha ) \ge \frac{V^{(N)}_{\textrm{opt}} (\alpha )}{N} = V^{(1)}_{\textrm{rel}}(\alpha ) + \frac{\mathbb {E}_{\textbf{a}^*}h(\textbf{S}(1)) - h(\textbf{S}^*)}{N} \ge V^{(1)}_{\textrm{rel}}(\alpha ) - \frac{CD}{\sqrt{N}}, \end{aligned}$$

(19)

which implies that $V^{(N)}_{\textrm{opt}} (\alpha )/ N \rightarrow V^{(1)}_{\textrm{rel}}(\alpha )$ when N goes to $+\infty $. Moreover, from (19), the convergence rate is at least as fast as $\mathcal {O}(1/ \sqrt{N})$. $\square $

B Proof of Lemma 2

In this appendix, we prove Lemma 2. We first show the piecewise affine property in Lemma 6, which gives (i) and (ii). We then show the uniqueness of fixed point from a bijective property in Lemma 7, from which we conclude (iii).

Lemma 6

(Piecewise affine) $\phi $ is a piecewise affine continuous function, with d affine pieces.

Proof

Let $\textbf{m}\in \Delta ^d$ be a configuration and recall $s(\textbf{m})\in \{1, \dots , d\}$ is the state such that $\sum _{i=1}^{s(\textbf{m})-1}m_i\le \alpha < \sum _{i=1}^{s(\textbf{m})}m_i$. When the system is in configuration $\textbf{m}$ at time t, WIP will activate all arms that are in states 1 to $s(\textbf{m})-1$ and not activate any arm in states $s(\textbf{m})+1$ to d. Among the $Nm_{s(\textbf{m})}$ arms in state $s(\textbf{m})$, $N(\alpha -\sum _{i=1}^{s(\textbf{m})-1}m_i)$ of them will be activated and the rest will not be activated.

This implies that the expected number of arms in state j at time $t+1$ will be equal to

$$\begin{aligned}&\sum _{i=1}^{s(\textbf{m})-1} N m_i P^1_{ij} \nonumber \\&+ N(\alpha -\sum _{i=1}^{s(\textbf{m})-1}m_i)P^1_{s(\textbf{m})j} + N(\sum _{i=1}^{s(\textbf{m})}m_i - \alpha )P^0_{s(\textbf{m})j} + \sum _{i=s(\textbf{m})+1}^{d} N m_i P^0_{ij}. \end{aligned}$$

(20)

It justifies the expression (8). Note that (8) can be reorganized to

$$\begin{aligned} \phi _j(\textbf{m}) = \sum _{i=1}^{s(\textbf{m})-1} m_i (P^1_{ij}-P^1_{s(\textbf{m})j}+P^0_{s(\textbf{m})j}) + \sum _{i=s(\textbf{m})}^{d} m_i P^0_{ij} + \alpha (P^1_{s(\textbf{m})j}-P^0_{s(\textbf{m})j}). \end{aligned}$$

Consequently $\phi (\textbf{m}) = \textbf{m}\cdot \textbf{K}_{s(\textbf{m})} + \textbf{b}_{s(\textbf{m})}$, where $\textbf{b}_{s(\textbf{m})} = \alpha (\textbf{P}^1_{s(\textbf{m})} - \textbf{P}^{0}_{s(\textbf{m})})$, and $\textbf{K}_{s(\textbf{m})} = $ $ \begin{pmatrix} \textbf{P}^1_1 - \textbf{P}^1_{s(\textbf{m})} + \textbf{P}_{s(\textbf{m})}^0 \\ \textbf{P}^1_2 - \textbf{P}^1_{s(\textbf{m})} + \textbf{P}_{s(\textbf{m})}^0 \\ ... \\ \textbf{P}^1_{s(\textbf{m})-1} - \textbf{P}^1_{s(\textbf{m})} + \textbf{P}^0_{s(\textbf{m})} \\ \textbf{P}^0_{s(\textbf{m})} \\ \textbf{P}^0_{s(\textbf{m})+1} \\ ... \\ \textbf{P}^0_{d} \end{pmatrix} $.

Let $\mathcal {Z}_i:= \{\textbf{m}\in \Delta ^d \mid s(\textbf{m})=i\}$. The above expression of $\phi $ implies that this map is affine on each zone $\mathcal {Z}_i$. There are d such zones with $1 \le i \le d$. It is clear from the expression that $\phi (\textbf{m})$ is continuous on $\textbf{m}$. $\square $

Lemma 7

(Bijectivity) Let $\pi (s,\theta ) \in \Pi $ be the policy that activates all arms in states $1,\dots ,s-1$, does not activate arms in states $s+1, s+2, \dots , d$, and that activates arms in state s with probability $\theta $. Denote by $\tilde{\alpha }(s,\theta )$ the proportion of time that the active action is taken using policy $\pi (s,\theta )$. Then, the function $(s,\theta )\mapsto \tilde{\alpha }(s,\theta )$ is a bijective map from $\{1 \dots d \} \times [0,1)$ to [0, 1).

Proof

The following proof is partially adapted from the proof of (Weber and Weiss, [31], Lemma 1). For a given $\nu \in \mathbb {R}$, denote by $\gamma (\nu )$ the value of the subsidy-$\nu $ problem, i.e.,

$$\begin{aligned} \gamma (\nu ) := \max _{\pi \in \Pi } \lim _{T \rightarrow \infty } \frac{1}{T} \mathbb {E} \Big [ \sum _{t=0}^{T-1} \Big ( R^{\pi (S(t))}_{S(t)}+\nu \big (1-\pi (S(t))\big ) \Big ) \Big ]. \end{aligned}$$

(21)

We defined similarly $\gamma _{\pi } (\nu )$ as the value under policy $\pi $ for a such subsidy-$\nu $ problem. Note that for fixed $\pi $, the function $\gamma _{\pi } (\nu )$ is affine and increasing in $\nu $.

By definition of indexability, $\gamma (\nu ) = \max _{\pi \in \Pi } \gamma _{\pi } (\nu )$ is a piecewise affine, continuous and convex function of $\nu $: it is affine on $(-\infty ;\nu _d]$, on $[\nu _1;+\infty )$ and on all $[\nu _s;\nu _{s-1}]$ for $s\in \{2\dots d\}$.

Moreover, for $s\in \{2\dots d-1\}$ and $\nu \in [\nu _s;\nu _{s-1}]$, the optimal policy of (21) is to activate all arms up to state $s-1$. Hence,

$$\begin{aligned} \gamma (\nu )&= \gamma _{\pi (s,0)} (\nu ) = \gamma (\nu _{s-1}) + \big (1-\tilde{\alpha }(s,0)\big ) \cdot (\nu - \nu _{s-1}). \end{aligned}$$

Similarly, and as $\tilde{\alpha }(s+1,0)=\tilde{\alpha }(s,1)$, for $\nu \in [\nu _{s+1};\nu _{s}]$ we have:

$$\begin{aligned} \gamma (\nu )&= \gamma (\nu _{s}) + \big (1-\tilde{\alpha }(s+1,0)\big ) \cdot (\nu - \nu _{s}) \\&= \gamma (\nu _{s}) + \big (1-\tilde{\alpha }(s,1)\big )\cdot (\nu - \nu _{s}). \end{aligned}$$

Consequently

$$\begin{aligned} \frac{\partial \gamma }{\partial \nu } (\nu ) = {\left\{ \begin{array}{ll} 1 - \tilde{\alpha }(s,0), &{} \text{ if } \nu _s< \nu< \nu _{s-1} \\ 1 - \tilde{\alpha }(s,1), &{} \text{ if } \nu _{s+1}< \nu < \nu _{s}. \end{array}\right. } \end{aligned}$$

The convexity of $\gamma (\nu )$ implies that $1-\tilde{\alpha }(s,0) > 1-\tilde{\alpha }(s,1)$, hence $\tilde{\alpha }(s,1) > \tilde{\alpha }(s,0)$.

Now suppose that $\textbf{m}^0$ and $\textbf{m}^1$ are the equilibrium distributions of policies $\pi (s,0)$ and $\pi (s,1)$. Let $0< \theta < 1$. The equilibrium distribution $\textbf{m}^{\theta }$ induced by $\pi (s,\theta )$ is then a linear combination of $\textbf{m}^0$ and $\textbf{m}^1$, namely $\textbf{m}^{\theta } = p\cdot \textbf{m}^0 + (1-p)\cdot \textbf{m}^1$, with

$$\begin{aligned} p = \frac{(1- \theta )m_s^{1}}{\theta m_s^0 + (1-\theta )m_s^1}. \end{aligned}$$

Hence,

$$\begin{aligned} m_s^{\theta }&= p m^0_s + (1-p)m^1_s \\&= \frac{m^1_s m^0_s}{\theta m^0_s + (1-\theta )m^1_s}, \end{aligned}$$

and

$$\begin{aligned} \tilde{\alpha }(s,\theta )&= \bigg ( \sum _{k=1}^{s-1}m_k^{\theta } \bigg ) + \theta m_s^{\theta } \\&= \ \sum _{k=1}^{s-1} \big ( (1-p) m^1_k + p m^0_k \big ) + \frac{\theta \cdot m^1_s m^0_s}{\theta m^0_s + (1-\theta )m^1_s} \\&= \ \frac{\sum _{k=1}^{s-1} \big ( \theta \cdot m_s^0 m_k^1 + (1-\theta ) m_s^1 m_k^0 \big ) \ + \theta \cdot m^1_s m^0_s}{\theta m^0_s + (1-\theta )m^1_s}. \end{aligned}$$

Observe that $\tilde{\alpha }(s,\theta )$ is the ratio of two affine functions of $\theta $, hence is monotone as $\theta $ ranges from 0 to 1; but as $\tilde{\alpha }(s,1) > \tilde{\alpha }(s,0)$, it is monotonically increasing. We hence obtain

$$\begin{aligned} 1 = \tilde{\alpha }(d,1)> \tilde{\alpha }(d,0) = \tilde{\alpha }(d-1,1)> \dots> \tilde{\alpha }(2,0) = \tilde{\alpha }(1,1) > \tilde{\alpha }(1,0) = 0, \end{aligned}$$

which concludes the proof. $\square $

We are now ready to finish the proof of Lemma 2(iii). Let $\textbf{m}$ be a fixed point of the continuous map $\phi $ (that exists by Brouwer’s fixed-point theorem). Under configuration $\textbf{m}$, all arms that are in states from 1 to $s(\textbf{m})-1$ are activated, and a fraction $\theta (\textbf{m})=(\alpha -\sum _{i=1}^{s(\textbf{m})-1} m_i)/m_{s(\textbf{m})}$ of the arms that are in state $s(\textbf{m})$ are activated. This shows that $\textbf{m}$ also corresponds to the stationary distribution of the policy $\pi (s(\textbf{m}),\theta (\textbf{m}))$. The proportion of activated arms of this policy is $\tilde{\alpha }(s(\textbf{m}),\theta (\textbf{m}))=\alpha $. Consequently, if $\textbf{m}'$ is another fixed point of $\phi $, then $\textbf{m}'$ would have to be the stationary distribution of some other policy of the form $\pi (s',\theta ')$, with $\tilde{\alpha }(s',\theta ') = \alpha $. As the function $(s,\theta )\mapsto \tilde{\alpha }(s,\theta )$ is a bijection, this implies that $s' = s(\textbf{m})$ and $\theta '=\theta (\textbf{m})$. Hence, the fixed point of $\phi $ is unique.

C Proof of Theorem 3

In this appendix, we explain technical details of the proof of our main result Theorem 3. In the following, we denote by $\mathcal {B}(\textbf{m}^*, r)$ the ball centered at $\textbf{m}^*$ with radius r.

Theorem 8

Under the same assumptions as in Theorem 3, and assume that $\textbf{M}^{(N)}(0)$ is already in stationary regime. Then there exists two constants $b,c >0$ such that

(i)
$\Vert \mathbb {E}[\textbf{M}^{(N)} (0)] - \textbf{m}^* \Vert \le b \cdot e^{-cN}$;
(ii)
$\mathbb {P}\left[ \textbf{M}^{(N)}(0)\not \in \mathcal {Z}_{s(\textbf{m}^*)}\right] \le b \cdot e^{-cN}$.

Let us first explain how Theorem 8 implies Theorem 3. To show this, we first prove that:

Lemma 9

Assume that bandits are indexable, and let $\rho (\textbf{m})$ be the instantaneous arm-averaged reward of WIP when the system is in configuration $\textbf{m}$. Then:

(i)
$\rho $ is piecewise affine on each of the zone $\mathcal {Z}_i$ and for all $\textbf{m}\in \Delta ^d$:
$$\begin{aligned} \rho (\textbf{m})=&\sum _{i=1}^{s(\textbf{m})-1} m_i R^1_{i} + \left( \alpha -\sum _{i=1}^{s(\textbf{m})-1}m_i\right) R^1_{s(\textbf{m})} + \left( \sum _{i=1}^{s(\textbf{m})}m_i - \alpha \right) R^0_{s(\textbf{m})}\nonumber \\&+ \sum _{i=s(\textbf{m})+1}^{d} m_i R^0_{i}. \end{aligned}$$
(22)
(ii)
$\rho (\textbf{m}^*)=V^{(1)}_{\textrm{rel}}(\alpha )$.

Proof

Let $\textbf{m}\in \Delta ^d$ be a configuration and recall $s(\textbf{m})\in \{1, \dots , d\}$ is the state such that $\sum _{i=1}^{s(\textbf{m})-1}m_i\le \alpha < \sum _{i=1}^{s(\textbf{m})}m_i$. Similarly to our analysis of Lemma 6, when the system is in configuration $\textbf{m}$, WIP will activate all arms that are in states 1 to $s(\textbf{m})-1$. This will lead an instantaneous reward of $\sum _{i=1}^{s(\textbf{m})-1}Nm_iR^1_i$. WIP will not activate arms that are in states $s(\textbf{m})+1$ to d. This will lead an instantaneous reward of $\sum _{i=s(\textbf{m})+1}^{d}Nm_iR^0_i$. Among the $Nm_{s(\textbf{m})}$ arms in state $s(\textbf{m})$, $N(\alpha -\sum _{i=1}^{s(\textbf{m})-1}m_i)$ of them will be activated and the rest will not be activated. This shows that $\rho (\textbf{m})$ is given by (22).

For (ii), recall that $\textbf{m}^*$ is the unique fixed point, and consider a subsidy-$\nu _{s(\textbf{m}^*)}$ MDP, where $\nu _{s(\textbf{m}^*)}$ is the Whittle index of state $s(\textbf{m}^*)$. Denote by L the value of this MDP:

$$\begin{aligned} L&:= \max _{\Pi } \lim _{T\rightarrow \infty }\frac{1}{T} \sum _{t=0}^{T-1} \mathbb {E}\left[ R^{a_n(t)}_{S_n(t)} + (\alpha -a_n(t))\nu _{s(\textbf{m}^*)}\right] \nonumber \\&= \max _{\Pi } \lim _{T\rightarrow \infty }\frac{1}{T} \sum _{t=0}^{T-1} \mathbb {E}\left[ R^{a_n(t)}_{S_n(t)}\right] + \left( \alpha -\lim _{T\rightarrow \infty }\frac{1}{T} \sum _{t=0}^{T-1}\mathbb {E}\left[ a_n(t)\right] \right) \nu _{s(\textbf{m}^*)}. \end{aligned}$$

(23)

By definition of Whittle index, any policy of the form $\pi (s(\textbf{m}^*),\theta )$ defined in Lemma 7 is optimal for (23). Moreover, if $\theta ^*$ is such that $\tilde{\alpha }(s(\textbf{m}^*),\theta ^*)=\alpha $, then such a policy satisfies the constraint (5): $\lim _{T\rightarrow \infty }\frac{1}{T} \sum _{t=0}^{T-1}\mathbb {E}\left[ a_n(t)\right] = \alpha $. This shows that $L=V^{(1)}_{\textrm{rel}}(\alpha )$ and as all arms are identical, we have $N \cdot V^{(1)}_{\textrm{rel}}(\alpha )=V^{(N)}_{\textrm{rel}}(\alpha )$, and $\pi (s(\textbf{m}^*),\theta ^*)$ is an optimal policy for the relaxed constraint (5).

It remains to show that the reward of policy $\pi (s(\textbf{m}^*),\theta ^*)$ is $\rho (\textbf{m}^*)$. This comes from the fact that the steady-state of the Markov chain induced by this policy is $\textbf{m}^*$, and $\pi (s(\textbf{m}^*), \theta ^*)$ is such that $\alpha N$ arms are activated on average. Indeed, the arm-averaged reward of this policy is:

$$\begin{aligned} L = \sum _{i=1}^{s(\textbf{m}^*)-1} m^*_i R^1_i + \theta ^* m^*_{s(\textbf{m}^*)}R^1_{s(\textbf{m}^*)} + (1-\theta ^*) m^*_{s(\textbf{m}^*)}R^0_{s(\textbf{m}^*)} + \sum _{i=s(\textbf{m}^*)+1}^d m^*_i R^0_i \end{aligned}$$

(24)

As the proportion of activated arms is $\alpha $, we have $\sum _{i=1}^{s(\textbf{m}^*)-1} m^*_i + \theta ^* m^*_{s(\textbf{m}^*)}=\alpha $. Hence, (24) coincides with the expression of $\rho (\textbf{m}^*)$ in (22), and $\rho (\textbf{m}^*)= L = V^{(1)}_{\textrm{rel}}(\alpha )$. This concludes the proof of Lemma 9. $\square $

By definition, the performance of WIP is $V^{(N)}_{\textrm{WIP}} (\alpha )= N \cdot \mathbb {E}\left[ \rho (\textbf{M}^{(N)}(0))\right] $. Hence from Lemma 9 we have

$$\begin{aligned} V^{(N)}_{\textrm{rel}}(\alpha )- V^{(N)}_{\textrm{WIP}} (\alpha )&= N \cdot V^{(1)}_{\textrm{rel}}(\alpha ) - N \cdot \mathbb {E}\left[ \rho (\textbf{M}^{(N)}(0))\right] \\&= N \cdot \mathbb {E}\bigg [ \big ( \rho (\textbf{m}^*) - \rho (\textbf{M}^{(N)}(0)) \big ) \mathbbm {1}_{\{ \textbf{M}^{(N)}(0)\in \mathcal {Z}_{s(\textbf{m}^*)}\} } \\&\qquad +\big ( \rho (\textbf{m}^*) - \rho (\textbf{M}^{(N)}(0)) \big ) \mathbbm {1}_{\{ \textbf{M}^{(N)}(0)\not \in \mathcal {Z}_{s(\textbf{m}^*)}\} } \bigg ] \end{aligned}$$

By linearity of $\rho $ and Theorem 8(i), the first term inside the above expectation is exponentially small; by Theorem 8(ii) and since the rewards are bounded, the second term is also exponentially small.

In the rest of the section, we first prove a few technical lemma and conclude by proving Theorem 8.

1.1 C.1 Hoeffding’s inequality (for one transition)

Lemma 10

(Hoeffding’s inequality) For all $t \in \mathbb {N}$, we have

$$\begin{aligned} \textbf{M}^{(N)}(t+1) = \phi \big ( \textbf{M}^{(N)}(t) \big ) + {\varvec{\epsilon }}^{(N)}(t+1) \end{aligned}$$

where the random vector ${\varvec{\epsilon }}^{(N)} (t+1)$ is such that

$$\begin{aligned} \mathbb {E} [ {\varvec{\epsilon }}^{(N)} (t+1) \big | \textbf{M}^{(N)}(t) ] = \textbf{0}, \end{aligned}$$

and for all $\delta >0$:

$$\begin{aligned} \mathbb {P}\left[ \Vert {\varvec{\epsilon }}^{(N)}(t+1) \Vert \ge \delta \right] \le e^{-2N \delta ^2}. \end{aligned}$$

Proof

Since the N arms evolve independently, we may apply the following form of Hoeffding’s inequality: Let $X_1$, $X_2$,..., $X_N$ be N independent random variables bounded by the interval [0, 1], and define the empirical mean of these variables by $\overline{X}:= \frac{1}{N} (X_1 + X_2 +\cdots + X_N)$, then

$$\begin{aligned} \mathbb {P}\left[ \overline{X} - \mathbb {E}[\overline{X}] \ge \delta \right] \le e^{-2N\delta ^2}. \end{aligned}$$

More precisely, for a fixed $1 \le j \le d$, we have

$$\begin{aligned} M^{(N)}_j(t+1) = \frac{1}{N} \sum _{i=1}^{d} \sum _{k=1}^{N \cdot M^{(N)}_i(t) } \mathbbm {1}_{\{ U_{i,k} \le P_{ij}(\textbf{M}^{(N)}(t)) \}} \end{aligned}$$

where for $1 \le i \le d$, $\ 1 \le k \le N \cdot M^{(N)}_i(t)$, the $U_{i,k}$’s are in total N independent and identically distributed uniform (0, 1) random variables, and $P_{ij}(\textbf{m})$ is the probability for an arm in state i goes to state j under WIP, when the N arms system is in configuration $\textbf{m}$.

By definition, we have

$$\begin{aligned} \phi _j (\textbf{M}^{(N)}(t)) = \sum _{i=1}^{d} M^{(N)}_i(t) \cdot P_{ij}(\textbf{M}^{(N)} (t)). \end{aligned}$$

Hence,

$$\begin{aligned} \mathbb {E}\big [ M^{(N)}_j(t+1) \big | \textbf{M}^{(N)}(t) \big ] = \sum _{i=1}^{d} \frac{1}{N} \cdot N \cdot M^{(N)}_i(t) \cdot P_{ij} (\textbf{M}^{(N)} (t)) = \phi _j (\textbf{M}^{(N)}(t)), \end{aligned}$$

and

$$\begin{aligned}&\mathbb {P}\left[ \Vert \textbf{M}^{(N)}(t+1) - \phi (\textbf{M}^{(N)} (t)) \Vert \ge \delta \right] \\&\quad = \mathbb {P}\left[ \max _{1 \le j \le d} \big | M^{(N)}_j(t+1) - \phi _j (\textbf{M}^{(N)}(t)) \big | \ge \delta \right] \\&\le e^{-2N\delta ^2}, \end{aligned}$$

where the last inequality comes from the above form of Hoeffding’s inequality. $\square $

1.2 C.2 Hoeffding’s inequality (for t transitions)

Lemma 11

There exists a positive constant K such that for all $t \in \mathbb {N}$ and for all $\delta > 0$,

$$\begin{aligned} \mathbb {P}\left[ \Vert \textbf{M}^{(N)}(t+1) - \Phi _{t+1}(\textbf{m}) \Vert \ge (1 {+} K {+} \cdots {+} K^t)\delta \ \Big | \ \textbf{M}^{(N)}(0) = \textbf{m}\right] \le (t+1)e^{-2N\delta ^2}. \end{aligned}$$

Proof

Since $\phi $ is a piecewise affine function with finite affine pieces, in particular $\phi $ is K-Lipschitz: there is a constant $K > 0$ such that for all $\textbf{m}_1, \textbf{m}_2 \in \Delta ^d$:

$$\begin{aligned} \Vert \phi (\textbf{m}_1) - \phi (\textbf{m}_2) \Vert \le K \cdot \Vert \textbf{m}_1 - \textbf{m}_2 \Vert . \end{aligned}$$

Let $t \in \mathbb {N}$ and $\textbf{m}\in $ be fixed, we have

$$\begin{aligned}&\Vert \textbf{M}^{(N)} (t{+}1) - \Phi _{t{+}1} (\textbf{m}) \Vert \\&\quad \le \Vert \textbf{M}^{(N)} (t{+}1) - \phi (\textbf{M}^{(N)}(t)) \Vert + \Vert \phi (\textbf{M}^{(N)} (t)) - \phi (\Phi _t (\textbf{m})) \Vert \\&\quad \le \Vert \textbf{M}^{(N)} (t{+}1) - \phi (\textbf{M}^{(N)}(t)) \Vert + K \cdot \Vert \textbf{M}^{(N)} (t) - \Phi _{t} (\textbf{m}) \Vert . \end{aligned}$$

By iterating the above inequality, we obtain

$$\begin{aligned}&\Vert \textbf{M}^{(N)} (t+1) - \Phi _{t+1} (\textbf{m}) \Vert \\&\quad \le \Vert \textbf{M}^{(N)} (t+1) - \phi (\textbf{M}^{(N)}(t)) \Vert + K \cdot \Vert \textbf{M}^{(N)} (t) - \phi (\textbf{M}^{(N)}(t-1)) \Vert \\&\qquad + K^2 \cdot \Vert \textbf{M}^{(N)} (t-1) - \Phi _{t-1}(\textbf{m}) \Vert \\&\quad \le \sum _{s=0}^{t} K^s \cdot \Vert \textbf{M}^{(N)} (t+1-s) - \phi (\textbf{M}^{(N)}(t-s)) \Vert , \end{aligned}$$

where for each $0 \le s \le t$, we have by lemma 10: for all $\delta > 0$,

$$\begin{aligned} \mathbb {P}\left[ \Vert \textbf{M}^{(N)} (t+1-s) - \phi (\textbf{M}^{(N)}(t-s)) \Vert \ge \delta \right] \le e^{-2N \delta ^2}. \end{aligned}$$

Hence, using the union bound, we obtain

$$\begin{aligned}&\mathbb {P}\left[ \Vert \mathbf {\textbf{M}}^{(N)}(t+1) - \Phi _{t+1}(\textbf{m}) \Vert \ge (1 + K + K^2 + \cdots + K^t)\delta \ \Big | \ \textbf{M}^{(N)}(0) = \textbf{m}\right] \\&\quad \le \mathbb {P}\left[ \sum _{s=0}^{t} K^s \cdot \Vert \textbf{M}^{(N)} (t+1-s) - \phi (\textbf{M}^{(N)}(t-s)) \Vert \ge (1 + K + K^2 + \cdots + K^t)\delta \right] \\&\quad \le \mathbb {P}\left[ \bigcup _{s=0}^{t} \big \{ \Vert \textbf{M}^{(N)} (t+1-s) - \phi (\textbf{M}^{(N)}(t-s)) \Vert \ge \delta \big \} \right] \\&\quad \le \sum _{s=0}^{t} \mathbb {P}\left[ \Vert \textbf{M}^{(N)} (t+1-s) - \phi (\textbf{M}^{(N)}(t-s)) \Vert \ge \delta \right] \\&\quad \le (t+1)\cdot e^{-2N \delta ^2}, \end{aligned}$$

and this ends the proof of Lemma 11. $\square $

1.3 C.3 Exponential stability of $\textbf{m}^*$

Lemma 12

Under the assumptions of Theorem 3, there exists constants $b_1,b_2>0$ such that for all $t\ge 0$ and all $\textbf{m}\in \Delta ^d$:

$$\begin{aligned} \Vert \Phi _t (\textbf{m}) - \textbf{m}^* \Vert \le b_1 \cdot e^{-b_2 t} \cdot \Vert \textbf{m}- \textbf{m}^* \Vert . \end{aligned}$$

(25)

Proof

As $\phi $ is locally stable, for all $\varepsilon >0$, there exists $\delta >0$ such that if $\Vert \textbf{m}-\textbf{m}^*\Vert \le \delta $, then for all $t \ge 0$: $\Vert \Phi _t(\textbf{m})-\textbf{m}^*\Vert \le \varepsilon $. Recall that for all $\textbf{m}\in \mathcal {Z}_{s(\textbf{m}^*)}$, we have $\phi (\textbf{m})=(\textbf{m}-\textbf{m}^*) \cdot \textbf{K}_{s(\textbf{m}^*)}+\textbf{m}^*$. We choose $\varepsilon >0$ so that $\mathcal {B}(\textbf{m}^*,\varepsilon )\subset \mathcal {Z}_{s(\textbf{m}^*)}$.

Let us now show that there exists $T>0$ such that for all $\textbf{m}\in \Delta ^d$, $\Phi _T(\textbf{m})\in \mathcal {B}(\textbf{m}^*,\varepsilon )$. We shall reason by contradiction: If this is not true, then there exists a sequence of $t \in \mathbb {N}$ that goes to infinity and a corresponding $\{\textbf{m}_t\}_{t}$ such that $\Vert \Phi _t(\textbf{m}_t)-\textbf{m}^*\Vert \ge \varepsilon $. As $\Delta ^d$ is a compact space, there exists a subsequence of $\{\textbf{m}_t\}_{t}$ (denoted again as $\{\textbf{m}_t\}_{t}$) that converges to an element $\bar{\textbf{m}}$. On the other hand, as $\textbf{m}^*$ is an attractor, there exists $T_1$ such that $\Phi _{T_1}(\bar{\textbf{m}})\in \mathcal {B}(\textbf{m}^*,\delta /2)$. And since $\Phi _{T_1}(\cdot )$ is continuous, there exists $\eta >0$ such that if $\Vert \textbf{m}-\bar{\textbf{m}}\Vert \le \eta $, then $\Vert \Phi _{T_1}(\textbf{m})-\Phi _{T_1}(\bar{\textbf{m}})\Vert \le \delta /2$. As $\{\textbf{m}_t\}_{t}$ converges to $\bar{\textbf{m}}$, there exists $T_2$ such that for $t\ge T_2$, we have $\Vert \textbf{m}_t - \bar{\textbf{m}}\Vert \le \eta $. Consequently for $t \ge T_2$, we have

$$\begin{aligned} \Vert \Phi _{T_1}(\textbf{m}_t)-\textbf{m}^*\Vert \le \Vert \Phi _{T_1}(\textbf{m}_t) - \Phi _{T_1}(\bar{\textbf{m}})\Vert + \Vert \Phi _{T_1}(\bar{\textbf{m}}) - \textbf{m}^*\Vert \le \delta . \end{aligned}$$

Hence, for $t \ge \max (T_1,T_2)$, by our choice of $\varepsilon $ and $\delta $ from the local stability of $\phi $, we deduce that

$$\begin{aligned} \Vert \Phi _t(\textbf{m}_t) - \textbf{m}^*\Vert = \Vert \Phi _{t-T_1}(\Phi _{T_1}(\textbf{m}_t)) - \textbf{m}^*\Vert \le \varepsilon . \end{aligned}$$

This gives a contradiction! Consequently, there exists T such that for all $\textbf{m}\in \Delta ^d$, $\Phi _T(\textbf{m})\in \mathcal {B}(\textbf{m}^*,\varepsilon )$. This implies in particular that $\textbf{K}_{s(\textbf{m}^*)}$ is a stable matrix: the modules of all its eigenvalues are smaller than one. Moreover, we have for all $\textbf{m}\in \Delta ^d$ and $t\ge T$:

$$\begin{aligned} \Phi _t(\textbf{m})= \big (\Phi _T(\textbf{m}) - \textbf{m}^*\big ) \cdot \textbf{K}_{s(\textbf{m}^*)}^{t-T}+\textbf{m}^*. \end{aligned}$$

As $\mathcal {Z}_{s(\textbf{m}^*)}$ is a stable matrix, this implies that (25) holds for all $\textbf{m}\in \Delta ^d$. $\square $

1.4 C.4 Proof of Theorem 8

We are now ready to prove the main theorem.

Proof

The proof consists of several parts.

1.4.1 C.4.1 Choice of a neighborhood $\mathcal {N}$

The fixed point $\textbf{m}^*$ is in zone $\mathcal {Z}_{s(\textbf{m}^*)}$ in which $\phi $ can be written as

$$\begin{aligned} \phi (\textbf{m}) = (\textbf{m}- \textbf{m}^*) \cdot \textbf{K}_{s(\textbf{m}^*)}+ \textbf{m}^*. \end{aligned}$$

As $\textbf{m}^*$ is not singular, let $\mathcal {N}_1$ be a neighborhood of $\textbf{m}^*$ included in $\mathcal {Z}_{s(\textbf{m}^*)}$. Since $\textbf{m}^*$ is locally stable, $\textbf{K}_{s(\textbf{m}^*)}$ is a stable matrix. We can therefore choose a smaller neighborhood $\mathcal {N}_2 \subset \mathcal {N}_1$ so that $\Phi _t (\mathcal {N}_2) \subset \mathcal {N}_1$ for all $t\ge 0$. That is, the image of $\mathcal {N}_2$ under the maps $\Phi _{t\ge 0}$ remains inside $\mathcal {N}_1$. This is possible by stability of $\textbf{m}^*$. We next choose a neighborhood $\mathcal {N}_3 \subset \mathcal {N}_2$ and a $\delta > 0$ so that $(\phi (\mathcal {N}_3))^{\delta } \subset \mathcal {N}_2$, that is, the image of $\mathcal {N}_3$ under $\phi $ remains inside $\mathcal {N}_2$ and it is at least to a distance $\delta $ away from the boundary of $\mathcal {N}_2$. We finally fix $r > 0$ so that the intersection $\mathcal {B}(\textbf{m}^*, r) \cap \Delta ^d \subset \mathcal {N}_3$, and we choose our neighborhood $\mathcal {N}$ as

$$\begin{aligned} \mathcal {N}:= \mathcal {B}(\textbf{m}^*, r) \cap \Delta ^d. \end{aligned}$$

Note that the choice of r and $\delta $ is independent of N. From (ii) of Lemma 12, we denote furthermore by $\tilde{T}:= T(r/2)$ the finite time such that for all $\textbf{m}\in \Delta ^d$, $\Phi _{\tilde{T}+1} (\textbf{m}) \in \mathcal {B}(\textbf{m}^*, r/2)$.

1.4.2 C.4.2 Definition and properties of the function G.

Following the generator approach used for instance in Gast et al [12]. For $\textbf{m}\in \Delta ^d$, define $G: \Delta ^d \rightarrow \mathbb {R}^d$ as

$$\begin{aligned} G(\textbf{m}) := \sum _{t=0}^{\infty } \big ( \Phi _t (\textbf{m}) - \textbf{m}^*\big ). \end{aligned}$$

By using Lemma 12, for all $\textbf{m}\in \Delta ^d$ we have $\Vert G(\textbf{m}) \Vert \le \sum _{t=0}^{\infty } b_1 \cdot e^{-b_2t} \cdot \Vert \textbf{m}- \textbf{m}^* \Vert < \infty $. This shows that the function G is well defined and bounded. Denote by $ \overline{G}:= \sup _{\textbf{m}\in \Delta ^d} \Vert G(\textbf{m}) \Vert <\infty $.

By our choice of $\mathcal {N}_2$ defined above, for all $t\ge 0$ and $\textbf{m}\in \mathcal {N}_2$ we have:

$$\begin{aligned} \Phi _t (\textbf{m}) = (\textbf{m}- \textbf{m}^*) \cdot \textbf{K}_{s(\textbf{m}^*)}^t + \textbf{m}^*. \end{aligned}$$

(26)

Hence, for all $\textbf{m}\in \mathcal {N}_2$, we have

$$\begin{aligned} G(\textbf{m}) =&\ \sum _{t=0}^{\infty } \big (\Phi _t (\textbf{m}) - \textbf{m}^*\big ) \\ =&\ \sum _{t=0}^{\infty } ( \textbf{m}- \textbf{m}^*) \cdot \textbf{K}_{s(\textbf{m}^*)}^t \\ =&\ (\textbf{m}- \textbf{m}^*) \cdot (\textbf{I} - \textbf{K}_{s(\textbf{m}^*)})^{-1}, \end{aligned}$$

where the last equality holds because $\textbf{K}_{s(\textbf{m}^*)}$ is a stable matrix. Hence, in $\mathcal {N}_2$, $G(\textbf{m})$ is an affine function of $\textbf{m}$.

From the definition of function G, we see that for all $\textbf{m}\in \Delta ^d$:

$$\begin{aligned} G(\textbf{m}) - G(\phi (\textbf{m}))&= \sum _{t=0}^{\infty } \big ( \Phi _t (\textbf{m}) - \textbf{m}^*\big )- \sum _{t=0}^{\infty } \big ( \Phi _t (\phi (\textbf{m})) - \textbf{m}^*\big ) \\&= \sum _{t=0}^{\infty } \big ( \Phi _t (\textbf{m}) - \textbf{m}^*\big )- \sum _{t=1}^{\infty } \big ( \Phi _t (\textbf{m}) - \textbf{m}^*\big ) \\&= \textbf{m}- \textbf{m}^*, \end{aligned}$$

Hence,

$$\begin{aligned}&\mathbb {E}[\textbf{M}^{(N)} (0)] - \textbf{m}^* \nonumber \\&= \mathbb {E}\big [ G(\textbf{M}^{(N)} (0)) - G(\phi (\textbf{M}^{(N)} (0))) \big ] \ \ \ \ \text {(By the above equality)}\nonumber \\&= \mathbb {E}\big [ G(\textbf{M}^{(N)} (1)) - G(\phi (\textbf{M}^{(N)} (0))) \big ] \ \ \ \ \text {(Since}\, \textbf{M}^{(N)}(0)\,\text { is stationary)}\nonumber \\&= \mathbb {E}\bigg [ \mathbb {E}\big [ G(\textbf{M}^{(N)} (1)) - G(\phi (\textbf{m})) \mid \textbf{M}^{(N)}(0) = \textbf{m}\big ]\cdot \mathbbm {1}_{\{ \textbf{m}\notin \mathcal {N}\} } \end{aligned}$$

(27)

$$\begin{aligned}&+ \mathbb {E}\big [ G(\textbf{M}^{(N)} (1)) - G(\phi (\textbf{m})) \mid \textbf{M}^{(N)}(0) = \textbf{m}\big ]\cdot \mathbbm {1}_{ \{\textbf{m}\in \mathcal {N}\} } \bigg ]. \end{aligned}$$

(28)

In the following, we bound (27) and (28) separately.

1.4.3 C.4.3 Bound on (27)

As G is bounded by $\overline{G}$, we have

$$\begin{aligned}&\bigg | \bigg | \mathbb {E}\bigg [ \mathbb {E}\big [ G(\textbf{M}^{(N)} (1)) - G(\phi (\textbf{m})) \big | \textbf{M}^{(N)}(0) = \textbf{m}\big ]\cdot \mathbbm {1}_{\{ \textbf{m}\notin \mathcal {N}\} } \bigg ] \bigg | \bigg | \\&\le \ 2\overline{G} \cdot \mathbb {P}\left[ \textbf{M}^{(N)} (0) \notin \mathcal {N}\right] . \end{aligned}$$

We are left to bound $\mathbb {P}\left[ \textbf{M}^{(N)} (0) \notin \mathcal {N}\right] $. Let $u:= \big ( \frac{r}{2(1 + K + K^2 +\cdots + K^{\tilde{T})}} \big )^2$, where K is the Lipschitz constant of $\phi $. We have by Lemma 11:

$$\begin{aligned}&\mathbb {P}\left[ \Vert \textbf{M}^{(N)}(\tilde{T}+1) - \Phi _{\tilde{T}+1} (\textbf{m}) \Vert \ge \frac{r}{2} \ \Big | \ \textbf{M}^{(N)}(0) = \textbf{m}\right] \\&\quad = \ \mathbb {P}\left[ \Vert \textbf{M}^{(N)}(\tilde{T}+1) - \Phi _{\tilde{T}+1} (\textbf{m}) \Vert \ge (1 + K + K^2 + \cdots + K^{\tilde{T}})\sqrt{u} \ \Big | \ \textbf{M}^{(N)}(0) = \textbf{m}\right] \\&\quad \le \ (\tilde{T}+1) \cdot e^{-2uN}. \end{aligned}$$

This shows that

$$\begin{aligned} \mathbb {P}\left[ \textbf{M}^{(N)} (0) \notin \mathcal {N}\right]&= \mathbb {P}\left[ \Vert \textbf{M}^{(N)}(0) - \textbf{m}^* \Vert \ge r \right] \nonumber \\&= \mathbb {P}\left[ \Vert \textbf{M}^{(N)}(\tilde{T}+1) - \textbf{m}^* \Vert \ge r \right] \ \ \ \ \text{(By } \text{ stationarity) } \nonumber \\&\le \mathbb {P}\left[ \Vert \textbf{M}^{(N)} (\tilde{T}+1) - \Phi _{\tilde{T}+1} (\textbf{M}^{(N)}(0)) \Vert \ge \frac{r}{2}\right] \nonumber \\&\qquad +\mathbb {P}\left[ \Vert \Phi _{\tilde{T}+1}(\textbf{M}^{(N)}(0)) - \textbf{m}^* \Vert \ge \frac{r}{2} \right] \nonumber \\&= \mathbb {P}\left[ \Vert \textbf{M}^{(N)} (\tilde{T}+1) - \Phi _{\tilde{T}+1} (\textbf{M}^{(N)}(0)) \Vert \ge \frac{r}{2} \right] \nonumber \\&\le (\tilde{T}+1) \cdot e^{-2uN}, \end{aligned}$$

(29)

where the last equality comes from our choice of $\tilde{T} = T(r/2))$.

1.4.4 C.4.4 Bound on (28)

By Lemma 10, we have

$$\begin{aligned}&\mathbb {E}\big [ G(\textbf{M}^{(N)} (1)) - G(\phi (\textbf{m})) \ \big | \ \textbf{M}^{(N)}(0) = \textbf{m}\big ]\cdot \mathbbm {1}_{ \{\textbf{m}\in \mathcal {N}\} } \\&\quad = \mathbb {E}\big [ G(\phi (\textbf{m}) + {\varvec{\epsilon }}^{(N)} (1)) - G(\phi (\textbf{m})) \ \big | \ \textbf{M}^{(N)}(0) = \textbf{m}\big ]\cdot \mathbbm {1}_{ \{\textbf{m}\in \mathcal {N}\} } \\&\quad = \mathbb {E}\bigg [ \big ( G(\phi (\textbf{m}) + {\varvec{\epsilon }}^{(N)} (1)) - G(\phi (\textbf{m})) \big ) \cdot \mathbbm {1}_{\{ \Vert {\varvec{\epsilon }}^{(N)} (1) \Vert < \delta \}} \ \\&\qquad +\big ( G(\phi (\textbf{m}) + {\varvec{\epsilon }}^{(N)} (1)) - G(\phi (\textbf{m})) \big ) \cdot \mathbbm {1}_{\{ \Vert {\varvec{\epsilon }}^{(N)} (1) \Vert \ge \delta \}} \ \bigg | \ \textbf{M}^{(N)}(0) = \textbf{m}\bigg ]\cdot \mathbbm {1}_{ \{\textbf{m}\in \mathcal {N}\} } \end{aligned}$$

By our choice of $\mathcal {N}$ and $\delta $, for the first part of the above expectation, i.e., when the event $\{ \Vert {\varvec{\epsilon }}^{(N)} (1) \Vert < \delta \} $ occurs, $\phi (\textbf{m}) + {\varvec{\epsilon }}^{(N)} (1)$ will remain in $\mathcal {N}_2$, hence $G\big ( \phi (\textbf{m}) + {\varvec{\epsilon }}^{(N)} (1) \big )$ takes the same affine form as $G(\phi (\textbf{m}))$. Consequently

$$\begin{aligned}&\mathbb {E}\bigg [ \big ( G(\phi (\textbf{m}) + {\varvec{\epsilon }}^{(N)} (1)) - G(\phi (\textbf{m})) \big ) \cdot \mathbbm {1}_{\{ \Vert {\varvec{\epsilon }}^{(N)} (1) \Vert< \delta \}} \ \bigg | \ \textbf{M}^{(N)}(0) = \textbf{m}\bigg ]\cdot \mathbbm {1}_{ \{\textbf{m}\in \mathcal {N}\} } \\&\quad = \bigg [ G \big ( \mathbb {E}\big [\phi (\textbf{m}) + {\varvec{\epsilon }}^{(N)} (1) \ \big | \ \textbf{M}^{(N)} (0) = \textbf{m}\big ] \big ) - G \big ( \mathbb {E}\big [\phi (\textbf{m}) \ \big | \ \textbf{M}^{(N)} (0) = \textbf{m}\big ] \big ) \bigg ] \\&\mathbb {P} \big ( \{ \Vert {\varvec{\epsilon }}^{(N)}(1) \Vert < \delta \} \big ) \cdot \mathbbm {1}_{\{ \textbf{m}\in \mathcal {N}\}} \\&\qquad \big ( \textrm{Thanks} \, \textrm{to} \, \textrm{the} \, \textrm{affinity} \, \textrm{of} \, G \text{ in } \text{ this } \text{ case }, \ \textrm{we} \, \textrm{can} \, \textrm{interchange }\, \mathbb {E}\, \textrm{and} \, G \big ) \\&\quad = 0 \qquad \big ( \text {By Lemma 10} \big ). \end{aligned}$$

For the second part of the above expectation,

$$\begin{aligned}&\bigg | \bigg | \mathbb {E}\bigg [ \big ( G(\phi (\textbf{m}) + {\varvec{\epsilon }}^{(N)} (1)) - G(\phi (\textbf{m})) \big ) \cdot \mathbbm {1}_{\{ \Vert {\varvec{\epsilon }}^{(N)} (1) \Vert \ge \delta \}} \bigg | \textbf{M}^{(N)}(0) = \textbf{m}\bigg ] \bigg | \bigg | \cdot \mathbbm {1}_{ \{\textbf{m}\in \mathcal {N}\} } \\&\quad \le 2\overline{G} \cdot \mathbb {P} \big ( \Vert {\varvec{\epsilon }}^{(N)} (1) \Vert \ge \delta \big ) \\&\quad \le 2\overline{G} \cdot e^{-2N\delta ^2} \ \ \ \ \big ( \textrm{By Lemma 10} \big ). \end{aligned}$$

So finally

$$\begin{aligned} \big | \big | \mathbb {E}\big [ G(\textbf{M}^{(N)} (1)) - G(\phi (\textbf{m})) \big | \textbf{M}^{(N)}(0) = \textbf{m}\big ] \big | \big | \cdot \mathbbm {1}_{ \{\textbf{m}\in \mathcal {N}\} } \le 0 + 2\overline{G} \cdot e^{-2N \delta ^2} = 2\overline{G} \cdot e^{-2N \delta ^2}. \end{aligned}$$

1.4.5 C.4.5 Conclusion of the proof

To summarize, we have obtained by (29):

$$\begin{aligned} \mathbb {P}\left[ \textbf{M}^{(N)}(0)\not \in \mathcal {Z}_{s(\textbf{m}^*)}\right]&\le \mathbb {P}\left[ \big ( \textbf{M}^{(N)} (0) \notin \mathcal {N}\big )\right] \\&\le (\tilde{T}+1) \cdot e^{-2uN} \\&\le b \cdot e^{-cN}, \end{aligned}$$

and

$$\begin{aligned} \Vert \mathbb {E} \big [ \textbf{M}^{(N)}(0) \big ] - \textbf{m}^* \Vert&\le \ 2\overline{G} \cdot e^{-2N\delta ^2} + 2\overline{G}(\tilde{T}+1)\cdot e^{-2Nu} \\&\le \ b \cdot e^{-cN}, \end{aligned}$$

where b, c can be taken as $b:= (2\overline{G}+1)(\tilde{T}+2)$, $c:= \min (\delta ^2, u)$, and this concludes the proof of Theorem 8. $\square $

D Proof of Theorem 5

Recall that $\textbf{M}^{(N)}(t)$ is the configuration of the system at time t, which means that $M^{(N)}_i(t)$ is the fraction of arms that are in state i at time t. Let $\textbf{e}_i$ be the d-dimensional vector that has all its component equal to 0 except the ith one that equals 1. The process $\textbf{M}^{(N)}$ is a continuous-time Markov chain that jumps from a configuration $\textbf{m}$ to a configuration $\textbf{m}+\frac{1}{N}(\textbf{e}_j-\textbf{e}_i)$ when an arm jumps from state i to state j. For $i<s(\textbf{m})$, this occurs at rate $Nm_iQ^1_{ij}$ as all of these arms are activated. For $i>s(\textbf{m})$, this occurs at rate $Nm_iQ^0_{ij}$ as these arms are not activated. For $i=s(\textbf{m})$, this occurs at rate $N\big ((\alpha -\sum _{k=1}^{s(\textbf{m})-1} m_k)Q^1_{ij} + (\sum _{k=1}^{s(\textbf{m})} m_k-\alpha )Q^0_{ij}\big )$. Let us define:

$$\begin{aligned} \lambda _{ij}(\textbf{m}) = \left\{ \begin{array}{ll} m_iQ^1_{ij} &{} \text { if}\, i<s(\textbf{m})\\ (\alpha -\sum _{k=1}^{s(\textbf{m})-1} m_k)Q^1_{ij} + (\sum _{k=1}^{s(\textbf{m})} m_k-\alpha )Q^0_{ij} &{} \text { if}\, i=s(\textbf{m})\\ m_iQ^0_{ij} &{} \text { if}\, i>s(\textbf{m}). \end{array} \right. \end{aligned}$$

The process $\textbf{M}^{(N)}$ jumps from $\textbf{m}$ to $\textbf{m}+(\textbf{e}_j-\textbf{e}_i)/N$ at rate $N \lambda _{ij}(\textbf{m})$. This shows that $\textbf{M}^{(N)}$ is a density-dependent population process as defined in Kurtz [20]. It is shown in Kurtz [20] that, for any finite time t, the trajectories of $\textbf{M}^{(N)}(t)$ converge to the solution of a differential equation $\dot{\textbf{m}}=f(\textbf{m})$ as N grows, with $f(\textbf{m}):= \sum _{i\ne j}\lambda _{ij}(\textbf{m})(\textbf{e}_j-\textbf{e}_i)$. The function $f(\textbf{m})$ is called the drift of the system. It should be clear that $f(\textbf{m})=\tau (\phi (\textbf{m})-\textbf{m})$, where $\phi $ is defined for the discrete-time version of our continuous-time bandit problem.

For $t \ge 0$, denote by $\Phi _t \textbf{m}$ the value at time t of the solution of the differential equation that starts in $\textbf{m}$ at time 0, it satisfies

$$\begin{aligned} \Phi _t \textbf{m}= \textbf{m}+ \int _{0}^{t} f(\Phi _s \textbf{m}) ds. \end{aligned}$$

Following Gast and Van Houdt [10]; Ying [34], we denote by $L^{(N)}$ the generator of the N arms system and by $\Lambda $ the generator of the differential equation. They associate to each almost-everywhere differentiable function h two functions $L^{(N)}h$ and $\Lambda h$ that are defined as

$$\begin{aligned} \big ( L^{(N)}h \big )(\textbf{m})&:= \sum _{i=1}^{d}\sum _{j \ne i} N \lambda _{ij}(\textbf{m})\cdot \big ( h(\textbf{m}+\frac{\textbf{e}_j-\textbf{e}_i}{N}) - h(\textbf{m}) \big ),\\ \big ( \Lambda h \big )(\textbf{m})&:= f(\textbf{m}) \cdot Dh(\textbf{m}), \end{aligned}$$

with Dh being the differential of function h. The function $\Lambda h $ is defined only on points $\textbf{m}$ for which $h(\textbf{m})$ is differentiable. Remark that if $h(\textbf{m})$ is an affine function in $\textbf{m}$, i.e., $h(\textbf{m}) = \textbf{m}\cdot \textbf{B}+ \textbf{b}$, with $\textbf{B}$ a d-dimensional matrix and $\textbf{b}$ a d-dimensional vector, then $\big (L^{(N)}h\big )(\textbf{m}) = \big (\Lambda h \big ) (\textbf{m}) = f(\textbf{m}) \cdot \textbf{B}$.

Now the analogue of Theorem 8(i) in the continuous-time case is

Theorem 13

Under the same assumptions as in Theorem 5, and assume that $\textbf{M}^{(N)} (0)$ is already in stationary regime. Then there exists two constants $b,c >0$ such that

$$\begin{aligned} \Vert \mathbb {E}[\textbf{M}^{(N)} (0)] - \textbf{m}^{*} \Vert \le b \cdot e^{-cN}. \end{aligned}$$

Note first that similarly, Theorem 13 implies Theorem 5.

Proof

Define the continuous-time version of function G as

$$\begin{aligned} G(\textbf{m}):= \int _{0}^{\infty } \big ( \Phi _t \textbf{m}- \textbf{m}^{*} \big ) \hbox {d}t. \end{aligned}$$

As for the discrete-time case, our assumptions imply that the unique fixed point is an exponentially stable attractor and a result similar to Lemma 12 can be obtained for the continuous-time case. This implies that the function G is well-defined, continuous and bounded.

Recall that the function f is affine in $\mathcal {Z}_{s(\textbf{m}^*)}$: since if $\textbf{m}\in \mathcal {Z}_{s(\textbf{m}^*)}$, then $\phi (\textbf{m})=(\textbf{m}-\textbf{m}^*)\textbf{K}+\textbf{m}^*$ where K is a $d\times d$ matrix, and $f(\textbf{m})= \tau (\phi (\textbf{m})-\textbf{m}) = \tau (\textbf{m}-\textbf{m}^*)(\textbf{K}-\textbf{I})$. Now suppose $\textbf{m}\in \Delta ^d$ is such that $\Phi _t \textbf{m}$ remains inside $\mathcal {Z}_{s(\textbf{m}^*)}$ for all $t\ge 0$, then

$$\begin{aligned} \Phi _t \textbf{m}= (\textbf{m}-\textbf{m}^*)\cdot e^{t \cdot \tau (\textbf{K}-\textbf{I})} + \textbf{m}^*, \text{ and } \ G(\textbf{m}) = \frac{1}{\tau } (\textbf{m}-\textbf{m}^*)(\textbf{K}-\textbf{I})^{-1}. \end{aligned}$$

So as for the discrete-time case, $G(\textbf{m})$ is an affine function of $\textbf{m}$, with affine factor $\textbf{B}:= \frac{1}{\tau }(\textbf{K}-\textbf{I})^{-1}$.

As $\textbf{m}^*$ is non-singular, it is at a positive distance from the other zones $\mathcal {Z}_i \ne \mathcal {Z}_{s(\textbf{m}^*)}$ and we therefore define $\delta := \min _{i\ne s(\textbf{m}^*)} d(\textbf{m}^*,\mathcal {Z}_i)/2>0$, where $d(\cdot \, \cdot )$ is the distance under $\Vert \cdot \Vert $-norm. We then choose a neighborhood $\mathcal {N}_1:= \mathcal {B}(\textbf{m}^*, \epsilon _1) \cap \Delta ^d$ of $\textbf{m}^*$ such that for all $t \ge 0 $ and all initial condition $\textbf{m}\in \mathcal {N}_1$, $\Phi _t(\textbf{m}) \in \mathcal {B}(\textbf{m}^{*}, \delta )$. This is possible by the exponentially stable attractor property of $\textbf{m}^*$. Following Theorem 3.2 of Gast [9], we have

$$\begin{aligned}&\textbf{m}^* - \mathbb {E} \big [ \textbf{M}^{(N)}(0) \big ]\nonumber \\&\quad = \mathbb {E} \big [ \Lambda G \big ( \textbf{M}^{(N)} (0) \big ) \big ] \nonumber \\&\quad = \mathbb {E} \big [ (\Lambda - L^{(N)}) G \big ( \textbf{M}^{(N)} (0) \big ) \big ] \nonumber \\&\quad = \mathbb {E} \Big [ \Big ( (\Lambda - L^{(N)}) G \big ( \textbf{M}^{(N)} (0) \big ) \Big ) \cdot \mathbbm {1}_{\{ \textbf{M}^{(N)} (0) \in \mathcal {N}\} } \end{aligned}$$

(30)

$$\begin{aligned}&+ \Big ( (\Lambda - L^{(N)}) G \big ( \textbf{M}^{(N)} (0) \big ) \Big ) \cdot \mathbbm {1}_{ \{ \textbf{M}^{(N)} (0) \notin \mathcal {N} \} } \Big ], \end{aligned}$$

(31)

where $\mathcal {N}:= \mathcal {B}(\textbf{m}^*, \epsilon _1/2) \cap \Delta ^d$. Let $N_0:= \lceil 2/\epsilon _1 \rceil $. For $N\ge N_0$, $\textbf{m}\in \mathcal {N}$ verifies additionally that $\Phi _t \big ( \textbf{m}+ \frac{\textbf{e}_j-\textbf{e}_i}{N}\big ) \in \mathcal {Z}_{s(\textbf{m}^*)}$ for all $1 \le i \ne j \le d$ and $t\ge 0$. Hence, G is locally affine and for all $m\in \mathcal {N}$ and $N \ge N_0$, we have:

$$\begin{aligned} \big ( \Lambda G \big ) (\textbf{m}) = \big ( L^{(N)} G \big ) (\textbf{m}) = f(\textbf{m}) \cdot \textbf{B}. \end{aligned}$$

(32)

This shows that the first term of (30) is equal to zero.

For the second term, note that both G and $\Lambda G$ are continuous functions defined on the compact region $\Delta ^d$, hence they are both bounded, while $L^{(N)}G$ grows at most linearly with N. Hence, we can choose constants $u,v > 0$ independent of N such that:

$$\begin{aligned} \sup _{\textbf{m}\in \Delta ^d} \Vert \big ( \Lambda G \big ) (\textbf{m}) \Vert = u, \ \sup _{\textbf{m}\in \Delta ^d}\Vert \big (L^{(N)} G \big ) (\textbf{m}) \Vert \le v N. \end{aligned}$$

We are left to bound $\mathbb {P} \big ( \textbf{M}^{(N)} (0) \notin \mathcal {N} \big )$ exponentially from above. This could be done by using the (unnamed) proposition on page 644 of Weber and Weiss [31]. Yet, we were not able to find the paper referenced for the proof of this proposition. Hence, we provide below a direct proof of this. To achieve this, we rely on an exponential martingale concentration inequality, borrowed from Darling and Norris [6], which in our situation can be stated as

Lemma 14

Fix $T > 0$. Let K be the Lipschitz constant of drift f, denote $\lambda := \max _{i,j} \lambda _{ij}$, and $c_1:= e^{-2KT} / 18T$. If $\epsilon > 0$ is such that

$$\begin{aligned} 1 \ge \epsilon \lambda \cdot \text{ exp } \left( \frac{\epsilon ^2 e^{-KT}}{3T} \right) , \end{aligned}$$

(33)

then we have

$$\begin{aligned} \mathbb {P} \Big [ \sup _{t \le T} \Vert \textbf{M}^{(N)}(t) - \Phi _t \textbf{m}\Vert > \epsilon \Big | \ \textbf{M}^{(N)}(0) = \textbf{m}\Big ] \le 2d \cdot e^{-c_1 N\epsilon ^3}. \end{aligned}$$

(34)

The above lemma plays the role of Lemma 11 in discrete-time case. Note that its original form stated as Theorem 4.2 in Darling and Norris [6] is under a more general framework, which considered a continuous-time Markov chain with countable state-space evolves in $\mathbb {R}^d$, and discussed a differential equation approximation to the trajectories of such Markov chain. As such, the right-hand side of (34) has an additional term $\mathbb {P} (\Omega _0^c \cup \Omega _1^c \cup \Omega _2^c)$, with $\Omega _i^c$ being the complementary of $\Omega _i$. In our case, $\Omega _0 = \Omega _1 = \Omega $ trivially holds; while the analysis of $\Omega _2$ is more involved. However, as remarked before the statement of Theorem 4.2 in Darling and Norris [6], if the maximum jump rate (in our case $N \lambda $) and the maximum jump size (in our case 1/N) of the Markov chain satisfy certain inequality, which in our situation can be sated as (33), then $\Omega _2 = \Omega $. Note that the constraint (33) is satisfied as long as $\epsilon $ is sufficiently small, and consequently $\mathbb {P} (\Omega _0^c \cup \Omega _1^c \cup \Omega _2^c) = 0$.

Now let $\epsilon >0$ be such that $\mathcal {B}(\textbf{m}^*, 2\epsilon ) \cap \Delta ^d \subset \mathcal {N}$. The uniform global attractor assumption on $\textbf{m}^{*}$ ensures that there exists $T>0$ such that for all $\textbf{m}\in \Delta ^d$ and $t\ge T$: $\Phi _t \textbf{m}\in \mathcal {B}(\textbf{m}^*,\epsilon )$. Let such T and $\epsilon $ be as in Lemma 14 that verify additionally (33). This is possible as the right-hand side of (33) converges to 0 when $\epsilon $ is small and T is large.

We then have:

$$\begin{aligned}&\mathbb {P} \big [ \textbf{M}^{(N)}(0) \notin \mathcal {N} \big ]\\&= \mathbb {P} \big [ \textbf{M}^{(N)}(T) \notin \mathcal {N} \big ] \qquad (\text{ By } \text{ stationarity}) \\&\le \mathbb {P} \big [ \Vert \textbf{M}^{(N)}(T)-\textbf{m}^* \Vert \le 2 \epsilon \big ] \\&\le \mathbb {P} \big [ \Vert \textbf{M}^{(N)}(T) - \Phi _T (\textbf{M}^{(N)}(0)) \Vert> \epsilon \big ] + \mathbb {P} \big [ \Vert \Phi _T (\textbf{M}^{(N)}(0)) - \textbf{m}^{*} \Vert> \epsilon \big ] \\&= \mathbb {P} \big [ \Vert \textbf{M}^{(N)}(T) - \Phi _T (\textbf{M}^{(N)}(0)) \Vert > \epsilon \big ] \qquad \big ( \text{ By } \text{ our } \text{ choice } \text{ of }\, T\big ) \\&\le 2d \cdot e^{-c_1 N\epsilon ^3} \qquad \big ( \text{ We } \text{ apply } \text{(34) } \text{ of } \text{ Lemma } \text{14 } \big ). \end{aligned}$$

So in summary, (30)-(31) gives

$$\begin{aligned} \Vert \mathbb {E} \big [ \textbf{M}^{(N)}(0) \big ] - \textbf{m}^{*} \Vert&\le (u+vN)\cdot 2d \cdot e^{-c_1 N\epsilon ^3}. \end{aligned}$$

(35)

Moreover, for any $c'>0$ and $0<c<c'$, $N \cdot e^{-c'N}=\mathcal {O}(e^{-cN})$, so the right-hand side of (35) can be bounded by a term of the form $b \cdot e^{-cN}$. This concludes the proof of Theorem 13. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gast, N., Gaujal, B. & Yan, C. Exponential asymptotic optimality of Whittle index policy. Queueing Syst 104, 107–150 (2023). https://doi.org/10.1007/s11134-023-09875-x

Download citation

Received: 25 July 2022
Revised: 17 February 2023
Accepted: 18 April 2023
Published: 21 May 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11134-023-09875-x

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exponential asymptotic optimality of Whittle index policy

Abstract

Access this article

Similar content being viewed by others

On the Whittle index of Markov modulated restless bandits

An asymptotically optimal strategy for constrained multi-armed bandit problems

Boundary Crossing Probabilities for General Exponential Families

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Proof of Theorem 1

Proof

B Proof of Lemma 2

Lemma 6

Proof

Lemma 7

Proof

C Proof of Theorem 3

Theorem 8

Lemma 9

Proof

1.1 C.1 Hoeffding’s inequality (for one transition)

Lemma 10

Proof

1.2 C.2 Hoeffding’s inequality (for t transitions)

Lemma 11

Proof

1.3 C.3 Exponential stability of \(\textbf{m}^*\)

Lemma 12

Proof

1.4 C.4 Proof of Theorem 8

Proof

1.4.1 C.4.1 Choice of a neighborhood \(\mathcal {N}\)

1.4.2 C.4.2 Definition and properties of the function G.

1.4.3 C.4.3 Bound on (27)

1.4.4 C.4.4 Bound on (28)

1.4.5 C.4.5 Conclusion of the proof

D Proof of Theorem 5

Theorem 13

Proof

Lemma 14

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation