1 Introduction

The classical stochastic multi-armed bandit problem is specified by a collection of probability distributions \(\{P_{k}\}_{k =1}^K\), commonly referred to as arms. Here, there is a single agent which plays an arm \(I_t\) taking values in \([K]:=\{ 1,\ldots ,K \}\) at each time step \(t\in [T]\) and receives an associated reward \(X_t \sim P_{I_t}\). The agent’s goal is to minimise the expected regret \({{\,\mathrm{\mathbb {E}}\,}}[{R}_{T}]= T \mu _\star - \sum _{t=1}^{T}\mathbb {E}[X_{t}]\), where \(\mu _k\) is the expectation of a random variable with distribution \(P_k\), and \(\star :={{\,\mathrm{argmax}\,}}_{k \in [k]} \mu _k\) is the largest mean of the arms. The agent’s decisions must be made using only the knowledge acquired from previous actions and observed rewards.

We consider an extension of this problem where there are multiple agents collaborating on a multi-armed bandit problem [3, 17]. The agents may communicate with one another, and the agents decision’s of which arms to play can be made using the information from both their own reward history, and from the sequence of messages received from other agents. However, communication between agents is tightly restricted as described in Sect. 2. Specifically, time is divided into growing phases and each agent may receive only one message per phase. Furthermore, a message is limited to recommending the id of a single arm; no additional information may be exchanged. We show in Theorem 3.1 that, even with these restrictions on communication, it is possible to asymptotically match the optimal total regret achievable with unlimited communication.

The multi-agent version of the multi-armed bandit problem is motivated by multiple applications:

  • Decentralised web advertising Consider the problem of selecting an advertisement to be displayed on a website. The website will want to optimise which advertisements it chooses to display with the goal of maximising click-through rate. To do this, the website should react to its users interactions. Additionally, the website could be hosted on multiple web servers operating in parallel to serve many users at once. Here, each web server will select an advertisement to display each time a user requests the page. The web servers can benefit from sharing information on the performance of each advertisement. The web server communication will be limited by bandwidth and potentially geographical constraints. This motivates the communication constraints imposed in our setting. A bandit-based approach to ad selection is considered in [16].

  • Decentralised network routing In this problem, a user wants to send data over a network between two computers as fast as possible. In the network, there are many paths the data can be sent along. These paths will have different latency’s and the user can measure the latency of a path after using it. A bandit algorithm can be used with this information to choose the best path to send the data along. In addition, it is likely that multiple people are using the same network at once. These users can collaborate to the find the best paths faster. In [4, 8], bandit algorithms are applied to network routing problems.

  • Multi-robot systems Multi-agent multi-armed bandit algorithms can be used to operate multi-robot systems. In particular [18], consider the problem of foraging using a group of robots. In this problem, the robots need to search for the sites which they can forage the most from. The robots can communicate with other nearby robots over a wireless network which can be used to quickly identify the best sties to forage from. Since the communication is constrained locally, it is similar to the setting we are considering.

It should be noted that in these examples contextual information could be used to improve the decision-making and additionally the expected rewards may be non-stationary. In the network routing example, there could be a penalty if more than one user chooses a single path. However, we work in a simplified settings where we do not make these assumptions, which is currently more feasible to prove results in.

There has recently been growing interest in multi-agent multi-armed bandits. A setting in which agents communicate with a central node is considered in [9], while [2, 5, 15, 19] consider settings where agents can communicate rewards (not just arm ids) with their neighbours. Kola et al. [10] considered a model where agents observe the rewards of their neighbours. We follow the setting introduced in [3, 17] where agents may only communicate arm ids and do this through a gossiping PULL protocol. This ensures that in each round the number of bits communicated is bounded and relatively small. Additionally, we prefer a decentralised system over a centralised system as it does not have a single point of failure. Furthermore, a centralised system would have a high-communication overhead through its central node which may be limiting in applications. In a recent work [1], the authors introduced a method for achieving nearly minimax optimal regret in the gossiping and decentralised setting.

A central problem in the multi-armed bandit literature is the search for algorithms which perform optimally in the asymptotic regime of the time horizon T tending to infinity. Returning to the single-agent setting, Lai and Robbins [11] proved a fundamental lower bound on the regret incurred by any consistent algorithm. Here, we say that an algorithm is consistent if it achieves subpolynomial regret for all possible values of \(\{P_{k}\}_{k =1}^K\). (This precludes trivial algorithms like one which always selects a specific arm and has zero regret if that happens to be the best arm.) Lai and Robbins [11] showed that the regret of any consistent algorithm satisfies the following lower bound:

$$\begin{aligned} \liminf _{T \rightarrow \infty }\frac{\sum _n{{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {R}}_T]}{\log (T)}&\ge \sum _{i \ne \star } \frac{\mu _{\star } - \mu _i}{{{\,\mathrm{KL}\,}}(P_i, P_{\star })}, \end{aligned}$$
(1)

where \({{\,\mathrm{KL}\,}}\) denotes the Kullback–Leibler divergence. A significant breakthrough was achieved by [6, 14] who demonstrated that this bound is attained by the KL-UCB algorithm in the Bernoulli reward setting.

In this work, we consider the question of asymptotic optimality in the decentralised multi-agent setting. Our contributions are as follows:

  • We present a decentralised algorithm which builds upon and improves the Gossip-Insert-Eliminate method of Chawla et al. [3]. This algorithm leverages two innovations which reduce the amount of superfluous exploration. Firstly, we include a more efficient elimination mechanism which reduces the number of arms considered by each agent at any given time. Secondly, in the spirit of [6, 14], we use KL-type confidence intervals, rather than Hoeffding-type confidence intervals.

  • We provide a theoretical analysis of the expected regret of the algorithm we propose (Theorem 3.1). We show that it is optimal in the asymptotic regime. In particular, the aggregate expected regret matches the lower bound implied by (1), showing that our algorithm performs at least as well as any multi-agent algorithm, even with access to unlimited communication resources, in the asymptotic regime.

  • We find a regret bound which has a clear dependence on the graph structure (Theorem 3.15). This is done in the setting where agents pull recommendations uniformly at random from their neighbours. This will allow us to leverage an existing result of Giakkoupis [7] on the spreading time of a rumour on a network following a PULL protocol where time is discrete. We conclude this section by comparing the impact that three different graphs (complete, star and cycle) have on the scaling of regret bound.

  • We present empirical results that demonstrate that our algorithm performs well in a wide variety of settings, with lower finite sample regret than the baseline of [3] (Figs. 12). Interestingly, both modifications lead to a consistent improvement for a range of different values of the gap between best and second-best arm.

2 Setting and Algorithm

We now present our problem setting and algorithm. Throughout N will denote the number of agents, T the number of time steps and K the number of arms. Let \(X^n_{k,s}\) taking values in \(\{ 0,1\}\) denote the reward that agent \(n \in [N]\) receives by playing arm \(k\in [K]\) for the \(s^{th}\) time. We assume that these are i.i.d. Bernoulli(\(\mu _k\)) random variables. Let \(\star \in {{\,\mathrm{argmax}\,}}\mu _k\) and let \(\mu _{\star }:= \max _{k \in [K]}\mu _k\). We assume throughout that there is a unique best arm, so \(\star \) is uniquely defined.

Communication between agents is constrained by a strictly increasing sequence \((A_j)_{j \in {{\,\mathrm{\mathbb {N}}\,}}}\) of communication rounds and an \(N\times N\) probability matrix P as follows. The time horizon [T] is partitioned into phases, with phase j consisting of time steps t for which \(A_{j-1} <t \le A_j\) where \(A_0:=0\). Communication between agents only occurs once per phase, on the time steps \(A_j\), after each agent has played an arm. On these time steps, each agent PULLs a message from exactly one of their neighbours chosen at random, independently of everything else. The neighbouring agent is selected randomly according to P, with P(nq) denoting the probability that agent n will receive a message from agent q at the end of each phase j. We let \(Q\equiv Q^n_j \sim P(n,\cdot )\) be the random variable corresponding to the agent who sends a recommendation to agent n at the end of phase j. The message, from agent \(Q^n_j\) to n, is an arm recommendation \(O^j_n\) taking values in [K].

To ensure that the recommendations can spread to all agents, we assume that P is strongly connected, meaning that for any two agents \(i, j \in [N]\) with \(i \ne j\) there exists a sequence of agents \(n_1, \dots , n_l \in [N]\) such that \(P(i, n_1), P(n_1, n_2), \dots , P(n_{k-1}, n_l), P(n_l, j) > 0\).

Let \(I^n_t\) denote the random variable, taking values in [K], which specifies the index of the arm played by agent n in round t. This must be a measurable function of an agent’s previous reward history and the previous messages they have received. We let \(V^n_k(t):= \sum _{s=1}^t {{\,\mathrm{\mathbbm {1}}\,}}\{I^n_s=k\}\) denote the number of times agent n plays arm k in the first t rounds. Let \(X^n(t):=X^n_{I^n_t,V_k^n(t)}\) denote the reward received by agent n in round t.

The goal for each agent \(n \in [N]\) is to minimise their expected regret,

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {R}}_T^n] :=T \cdot \mu _{\star } - \sum _{t \in [T]} {{\,\mathrm{\mathbb {E}}\,}}[X^n(t)]. \end{aligned}$$

Our algorithm (Algorithm 1) is based on the Gossip-Insert-Eliminate algorithm of [3]. A key feature of this algorithm is that, during each phase j, each agent plays only a small subset of the K arms which we call its active set. This is made up of a sticky set of arms, which remains unchanged over time for each agent, and additional arms which evolve over time based on recommendations.

In our algorithm, we begin by partitioning [K] into nearly equal-sized sets \(\{{S}^n_{\circ }\}_{n \in [N]}\), so that for each agent \(n \in [N]\), \(S^n_\circ \) will act as the associated sticky set. The active sets are initialised to be the same as the sticky sets, but will grow over time due to recommendations and shrink due to eliminations of non-sticky arms. In each phase \(j\in {{\,\mathrm{\mathbb {N}}\,}}\), each agent \(n \in [N]\) will only play arms from the active set \(S^n_j\). For the first phase \(j=1\), we initialise each \(S^n_1= S^n_\circ \). In subsequent phases \(j>1\) the active set \(S^n_{j+1}\) consists of \(S^n_\circ \), along with (potentially) additional arms.

We assume that each agent n is aware of \({S}^n_{\circ }\), its own set of arms within the partition, a priori. That is, \({S}^n_{\circ }\) may be taken as an input to our algorithm. Let \({\hat{\mu }}^n_{k,s}:=\frac{1}{s}\sum _{i=1}^s X^n_{k,i}\). Denote by \({\hat{\mu }}^n_k(t):={\hat{\mu }}^n_{k,V^n_k(t)}\) the mean reward obtained by agent n from arm k in the first t time steps.

We let \(M_j^n\) denote the most played arm by agent n in phase j so

$$\begin{aligned} M_j^n = {{\,\mathrm{argmax}\,}}_{k \in [K]} \{V_k^n(A_j)-V^n_k(A_{j-1})\}. \end{aligned}$$

Following [3], when an agent \(q \in [N]\) is asked for an arm recommendation at the end of phase j, its recommendation will be its most played arm for that phase. Hence, when \(Q\equiv Q^n_j \sim P(n,\cdot )\) communicates with agent \(n \in [N]\) at the end of phase j, the recommendation will be \(O^j_n=M^{Q}_j\).

Our algorithm (Algorithm 1) differs from that of [3] in two important respects.

Firstly, we use a more efficient elimination scheme. More precisely, in each phase \(j+1\), the new active set \(S^n_{j+1}\) will be constituted by the sticky set \(S_\circ ^n\), together with the agent’s most played arm \(M^n_j\) during phase j, and the recommendation, \(O^n_j\), it receives at the end of phase j. We assume that the random variable \(Q^n_j\) is independent from everything else.

The intuition is that, eventually, the best arm will become known to all agents, and \(M^n_j\) and \(O^n_j\) will both be equal to \(\star \); consequently, \(S^n_j\) will be \(S^n_{\circ }\cup \{ \star \}.\)

Secondly, we use tighter KL based confidence intervals, following [6]. To define our KL upper confidence bounds, we first let \({{\,\mathrm{KL}\,}}: [0,1]^2 \rightarrow \mathbb {R} \cup \{\infty \}\) be the Kullback–Leibler divergence for two Bernoulli random variables and introduce a function \(f_\alpha (t) = 1 + t^{\alpha } \log ^2(t)\) indexed by \(\alpha \). The upper confidence bound for arm k at agent n at time t is defined by

$$\begin{aligned} \mathrm {U}^{n}_{k,\alpha }(t - 1) := \max \left\{ u \in [0, 1]: {{\,\mathrm{KL}\,}}({\hat{\mu }}^{n}_{k}(t - 1), u) \le \frac{\log (f_\alpha (t))}{V_k^n(t - 1)} \right\} \end{aligned}$$
(2)

when \(V_k^{n}(t - 1)>0\) and \(\mathrm {U}^{n}_{k}(t - 1) := \infty \) otherwise. When \(\alpha \) is clear from context we suppress it for notational convenience.

Algorithm 1 gives the steps that each agent \(n \in [N]\) will perform synchronously.

figure a

3 Theoretical Analysis and Regret Bound

We now present our asymptotically optimal regret bound for Algorithm 1.

Theorem 3.1

Suppose there exists \(C\ge 1\), \(\theta >0\) such that \(C^{-1} j^\theta \le A_{j}-A_{j-1} \le Cj^{\theta }\) for all \(j \in {{\,\mathrm{\mathbb {N}}\,}}\) and suppose that all agents select arms with Algorithm 1 with \(\alpha =1\). Then for each agent \(n \in [N]\), we have the asymptotic bound

$$\begin{aligned} \limsup _{T \rightarrow \infty } \frac{{{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {R}}^n_T]}{\log T} \le \sum _{ k \in S_{\circ }^n\backslash [\star ]}\frac{\mu _{\star }-\mu _k}{{{\,\mathrm{KL}\,}}(\mu _k,\mu _{\star })}. \end{aligned}$$

Let us consider the class of centralised algorithms \({\mathcal {A}}\) in which an arm \(I^n_t\) in [K] is selected for each agent \(n\in [N]\) and each time step \(t \in [T]\) based on the combined reward history of all the agents up to time t. We let \({\mathcal {A}}_{\mathrm {const}}\subseteq {\mathcal {A}}\) denote the subset of those which are consistent, i.e. achieve subpolynomial total regret \(\sum _n{{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {R}}^n_T]\) for any instance of the multi-armed bandit problem. It follows from the result of Lai and Robbins (1) that for any algorithm in the class \({\mathcal {A}}_{\mathrm {const}}\),

$$\begin{aligned} \liminf _{T \rightarrow \infty }\frac{\sum _n{{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {R}}^n_T]}{\log (T)} = \liminf _{T \rightarrow \infty } \left( \frac{\log (NT)}{\log (T)} \cdot \frac{\sum _n{{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {R}}^n_T]}{\log (NT)}\right) \ge \sum _{i \ne \star } \frac{\mu _{\star } - \mu _i}{{{\,\mathrm{KL}\,}}(\mu _i, \mu _{\star })} . \end{aligned}$$
(3)

Now note that we can view the class \({\mathcal {A}}\) as the collection of all multi-agent algorithms, with or without communication constraints. In particular, the class of decentralised multi-agent with strong communication constraints we consider in this paper correspond to a computationally attractive subset of \({\mathcal {A}}\). Observe that by summing over \(n \in [N]\) in the regret bound given in Theorem 3.1, we see that total regret of the system for our algorithm matches the lower bound given by (3) for the full communication setting. This implies that our algorithm (with limited communication) performs just as well as any algorithm, even with access to unlimited communication constraints, in the asymptotic regime.

In addition to Theorem 3.1, which only certifies the performance in the asymptotic regime, in Sect. 3.1, we continue the analysis of Algorithm 1 to derive a finite sample bound which has a clear dependence on the graph structure. Furthermore, in Sect. 4 we shall see that our algorithm also performs well empirically on a broad range of simulated data.

Before presenting the main proof of Theorem 3.1 we shall present a brief sketch. The argument hinges upon a random time \({\hat{\tau }}\) which corresponds to a phase after which all of the active sets \(S^n_j\) become fixed. After this time, all of the active sets become \(S^n_\circ \cup \{\star \}\), which leads to an asymptotic regret bound for agent n governed by the relationship between \(\mu _k\) and \(\mu _{\star }\) for \(k \in [K]\). The crucial difficulty then is to bound \({{\,\mathrm{\mathbb {E}}\,}}[A_{{{\hat{\tau }}}}]\), the expected time until the end of phase \({\hat{\tau }}\). To bound \({{\,\mathrm{\mathbb {E}}\,}}[A_{{{\hat{\tau }}}}]\), we show that, provided the phase lengths \(A_j-A_{j-1}\) are sufficiently large in relationship to the gap, the probability of a suboptimal arm being the most played and subsequently being recommended decays exponentially.

To bound the per agent expected regret of this system, we divide time into two parts; before \(A_{{{\hat{\tau }}}}\) and after \(A_{{{\hat{\tau }}}}\). The regret before time \(A_{{{\hat{\tau }}}}\) is trivially upper bounded by \({{\,\mathrm{\mathbb {E}}\,}}[A_{{{\hat{\tau }}}}]\), and since, after time \(A_{{{\hat{\tau }}}}\), the set of active arms for each remains fixed, this reduces to bounding the expected regret of a single-agent multi-armed bandit problem. For this, we consider the approach given in [12], where we show that for a late enough time, we expect that the KL-UCB for the optimal arm does not fall far below its true mean, and additionally, the KL-UCB for all suboptimal arms does not exceed this value often.

The proof of Theorem 3.15 (finite sample bound) is similar to that of 3.1, but they differ when bounding \({{\,\mathrm{\mathbb {E}}\,}}[A_{\tau }]\). The main difference is that Theorem 3.15 uses Lemma 3.13 instead of Lemma 3.8.

We now proceed with proof itself, which goes through a sequence of lemmas. Let us begin by introducing some notation used throughout. Firstly, fix the exploration function \(f(t) := 1 + t \log ^2(t)\) (i.e. \(\alpha = 1\)). Next we define the suboptimality gap for each arm \(k \in [K]\) by

$$\begin{aligned} \Delta _k:= \mu _{\star }- \mu _k, \end{aligned}$$

and we define the smallest suboptimality gap,

$$\begin{aligned} \Delta _{\min } := \min _{k \in [K] \setminus \{\star \}}\Delta _k > 0. \end{aligned}$$

For each \(\epsilon \in (0,\Delta _{\min })\) and each agent \(n \in [N]\), we define a random variable

$$\begin{aligned} \kappa ^n_\epsilon := \min \left\{ t \in {{\,\mathrm{\mathbb {N}}\,}}: \max _{s \in [T]}\left( {\underline{d}}\left( {\hat{\mu }}_{\star ,s}^n,\mu _{\star }-{\epsilon }\right) -\frac{\log (f(t))}{s}\right) \le 0\right\} , \end{aligned}$$

where \({\underline{d}}(p,q):={{\,\mathrm{KL}\,}}(p,q)\cdot {{\,\mathrm{\mathbbm {1}}\,}}\{p \le q\}\). This random variable is the time whereafter the KL upper confidence bound of the optimal arm will not fall below \(\mu _{\star } - \epsilon \), no matter how many times the optimal arm has been played.

Next we define for every \(\epsilon \in (0,\Delta _{\min })\), for every agent \(n \in [N]\) and for every suboptimal arm \(k \in [K]\setminus \{\star \}\),

$$\begin{aligned} \nu _{\epsilon ,k}^n:= \sum _{s=1}^T {{\,\mathrm{\mathbbm {1}}\,}}\left\{ {{\,\mathrm{KL}\,}}({\hat{\mu }}^n_{k,s},\mu _{\star }-\epsilon )\le \frac{\log (f(T))}{s}\right\} . \end{aligned}$$

This random variable is an upper bound for the number of times the KL upper confidence bound of a suboptimal k arm exceeds \(\mu _{\star } - \epsilon \).

Together, these random variables allow us to bound the regret. This is because after time \(\kappa _\epsilon ^n\) the number of times any suboptimal arm is played is bounded above by \(\nu _{\epsilon ,k}^n\). To do this, we require the following lemmas (Lemmas 3.2, 3.3), which are effectively the same as [12, Lemma 10.7 & Lemma 10.8].

Lemma 3.2

For \({\epsilon } \in (0,\Delta _{\min })\), \(\max _{n \in [N]}{{\,\mathrm{\mathbb {E}}\,}}[\kappa ^n_\epsilon ] \le 2/{\epsilon }^2\).

Lemma 3.3

For \({\epsilon } \in (0,\Delta _{\min })\) and \(n \in [N]\), we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[\nu _{\epsilon ,k}^n] \le \inf _{{\tilde{\epsilon }} \in (0,\Delta _k-\epsilon )} \left( \frac{\log f(T)}{{{\,\mathrm{KL}\,}}(\mu _k+{\tilde{\epsilon }},\mu _{\star }-\epsilon )}+\frac{1}{2{\tilde{\epsilon }}^2}\right) . \end{aligned}$$

To continue the proof, we define some further random variables that concern the optimal arm and its movement around the network.

Firstly, for each agent \(n \in [N]\) and each phase j we define a random variable

$$\begin{aligned} \chi _j^n:={{\,\mathrm{\mathbbm {1}}\,}}\{ \star \in S^n_j,~M^n_j \ne \star , A_{j-1}\ge \kappa ^n_{\circ } \}, \end{aligned}$$

where \(\kappa ^n_{\circ }:=\kappa ^n_{\Delta _{\min }/2}\). This variable indicates whether an agent has the best arm but has not played it most over the phase j (and therefore it will not recommend it). Additionally, the condition \(A_{j - 1} \ge \kappa _\circ ^n\) demands that we are in a late enough phase which is necessary for Lemma 3.7.

For each agent \(n \in [N]\), we define the following random variables:

$$\begin{aligned} {\hat{\tau }}^n_{\mathrm {stab}}&:=\min \{j \in {{\,\mathrm{\mathbb {N}}\,}}~:~ A_{j-1} \ge \kappa ^n_{\circ }, \forall j' \ge j,~\chi _{j'}^n=0\}\\ {\hat{\tau }}_{\mathrm {stab}}&:=\max _{n \in [N]}{\hat{\tau }}^n_{\mathrm {stab}}\\ {\hat{\tau }}^n_{\mathrm {spr}}&:=\min \{j \ge {\hat{\tau }}_{\mathrm {stab}}~:~\star \in S^n_j\}-{\hat{\tau }}_{\mathrm {stab}}\\ {\hat{\tau }}_{\mathrm {spr}}&:=\max _{n \in [N]}{\hat{\tau }}^n_{\mathrm {spr}}\\ {\hat{\tau }}&:={\hat{\tau }}_{\mathrm {stab}}+{\hat{\tau }}_{\mathrm {spr}}. \end{aligned}$$

These random variables highlight two key timings of the system (for each agent). The first being the stabilisation phase \({\hat{\tau }}_{\mathrm {stab}}^n\); this is the phase whereafter agent n will always recommend the best arm if it has the best arm. The second is the spreading time \({\hat{\tau }}_{\mathrm {spr}}\); this is the number of phases after \({\hat{\tau }}_{\mathrm {stab}}^n\), where agent n will have the best arm for all subsequent phases. After phase \({{\hat{\tau }}}\), each agent will have the best arm and only recommend the best arm; therefore, the set of active arms for each agent will be subsequently fixed. Lemma 3.4 proves this.

Lemma 3.4

For all phases \(j > {\hat{\tau }}\) and all \(n \in [N]\), we have \(S^n_j=S^n_\circ \cup \{\star \}\).

Proof

For each agent \(n \in [N]\), we see by induction that for any phase \(j \ge {\hat{\tau }}^n_{\mathrm {spr}}+{\hat{\tau }}_{\mathrm {stab}}\), we have that \(M^n_j=\star \in S^n_j\).

Moreover, since \(S_{j+1}^n=S_\circ ^n \cup \{M^n_j, M^Q_j\}\) for some agent Q in [N], it follows that \(S^n_{j+1}=S^n_\circ \cup \{\star \}\), for all \(j \ge {\hat{\tau }} = {\hat{\tau }}_{\mathrm {stab}}+{\hat{\tau }}_{\mathrm {spr}}\). \(\square \)

In the following lemma, we bound the number of times a suboptimal arm is played after the phase \({{\hat{\tau }}}\).

Lemma 3.5

For each agent \(n \in [N]\) and each suboptimal arm \(k \in [K] \backslash \{\star \}\), we have

$$\begin{aligned} \sum _{t=A_{{{\hat{\tau }}}}+1}^T{{\,\mathrm{\mathbbm {1}}\,}}\left\{ I^n_t = k\right\} \le {\left\{ \begin{array}{ll} \inf _{\epsilon \in (0,\Delta _{\min })} \left\{ \nu _{\epsilon ,k}^n+\kappa ^n_\epsilon \right\} &{}\text { if }k \in S^n_\circ \\ 0&{}\text { if }k \notin S^n_\circ . \end{array}\right. } \end{aligned}$$

Proof

Fix an agent \(n \in [N]\). First note that by Lemma 3.4 we have \(S^n_j= S^n_\circ \cup \{\star \}\) for all phases \(j >{{\hat{\tau }}}\). In particular, this means that \( I^n_t \notin S^n_\circ \cup \{\star \}\) cannot occur for \(t \ge A_{{{\hat{\tau }}}}+1\). Now take \(\epsilon \in (0,\Delta _{\min })\) and consider a suboptimal arm \(k \in S^n_\circ \backslash [\star ]\). If \(I_t^n = k\) for some \(t \ge (A_{{{\hat{\tau }}}}+1)\vee \kappa ^n_\epsilon \), then we must have \(\mathrm {U}^n_k(t-1) \ge \mathrm {U}^n_{\star }(t-1)\ge \mu _{\star }-\epsilon \), and hence,

$$\begin{aligned} {{\,\mathrm{KL}\,}}({\hat{\mu }}^n_{k,{V^n_k(t-1)}},\mu _{\star }-\epsilon )\le \frac{\log (f(t))}{V^n_k(t-1)}\le \frac{\log (f(T))}{V^n_k(t-1)}. \end{aligned}$$

Consequently,

$$\begin{aligned} \sum _{t=(A_{{{\hat{\tau }}}}+1)\vee \kappa ^n_\epsilon }^T{{\,\mathrm{\mathbbm {1}}\,}}\left\{ I^n_t = k\right\} \le \sum _{t=(A_{{{\hat{\tau }}}}+1)\vee \kappa ^n_\epsilon }^T{{\,\mathrm{\mathbbm {1}}\,}}\left\{ I_t^n = k \text { and } {{\,\mathrm{KL}\,}}({\hat{\mu }}^n_{k,{V^n_k(t-1)}},\mu _{\star }-\epsilon )\le \frac{\log (f(T))}{V^n_k(t-1)} \right\} \le \nu _{\epsilon ,k}^n, \end{aligned}$$

and therefore,

$$\begin{aligned} \sum _{t=A_{{{\hat{\tau }}}}+1}^T{{\,\mathrm{\mathbbm {1}}\,}}\left\{ I^n_t = k\right\} \le \nu _{\epsilon , k}^n + \kappa _\epsilon ^n. \end{aligned}$$

The result then follows by taking an infimum over \(\epsilon \in (0,\Delta _{\min })\). \(\square \)

This leads to the following regret bound.

Corollary 3.6

For each agent \(n \in [N]\), we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {R}}^n_T] \le {{\,\mathrm{\mathbb {E}}\,}}[A_{{\hat{\tau }}}]+\sum _{k \in S^n_\circ \backslash [\star ]} \Delta _k \inf _{\epsilon \in \left( 0,\frac{\Delta _{\min }}{2}\right) }\left\{ \frac{\log f(T)}{{{\,\mathrm{KL}\,}}(\mu _k+{\epsilon },\mu _{\star }-\epsilon )}+\frac{3}{{\epsilon }^2}\right\} . \end{aligned}$$

For the remainder of the proof, we must show that \({{\,\mathrm{\mathbb {E}}\,}}[A_{{\hat{\tau }}}]\) may be bounded independently of T.

We do this as follows: In Lemma 3.7, we show that if the length of a phase is large enough, then the expected value of \(\chi _j^n\) decays exponentially with phase length; in Lemmas 3.8 and 3.10, we find high probability bounds for \({\hat{\tau }}_{\mathrm {stab}}\) and \({\hat{\tau }}_{\mathrm {spr}}\), respectively; and we conclude in 3.11 by showing \({{\,\mathrm{\mathbb {E}}\,}}[A_{{{\hat{\tau }}}}]\) is finite and does not depend on the time horizon T.

Lemma 3.7

For every phase \(j \in {{\,\mathrm{\mathbb {N}}\,}}\) such that \(A_j-A_{j-1} \ge \frac{8}{\Delta ^2}\left( \frac{K}{N}+3\right) {\log f(A_j)}\), we have

$$\begin{aligned}{{\,\mathrm{\mathbb {E}}\,}}[\chi ^n_j]\le \frac{8K}{\Delta _{\min }^2} \exp \left( -\frac{\Delta _{\min }^2 (A_j-A_{j-1})}{16 (K/N+3)}\right) .\end{aligned}$$

Proof

First observe that if \(\chi _j^n=1\) then \(\star \in S^n_j\), \(A_{j-1}\ge \kappa ^n_{\circ }\) and \(M^n_j \ne \star \). Since \(M^n_j \ne \star \), we deduce that for some \(k \in [K]\backslash \{\star \}\), we have

$$\begin{aligned} V_k^n(A_j)-V_k^n(A_{j-1}) \ge \frac{A_j-A_{j-1}}{|S^n_j|} \ge \frac{A_j-A_{j-1}}{K/N+3}, \end{aligned}$$

and so, for some \(A_{j-1}<t\le A_j\) we have \(s=V_k^n(t-1) \ge \frac{A_j-A_{j-1}}{K/N+3}-1\) and \(I_t^n=k\), so \(\mathrm {U}^n_{k}(t-1) \ge \mathrm {U}^n_{\star }(t-1)\) as \(\star \in S^n_j\). Since \(t\ge A_{j-1} \ge \kappa ^n_{\circ }\) we deduce that \(\mathrm {U}^n_{k}(t-1) \ge \mathrm {U}^n_{\star }(t-1) \ge \mu _{\star }-\Delta _{\min }/2\). Hence, by Pinsker’s inequality

$$\begin{aligned} 2 \left( {\hat{\mu }}^{n}_{k,s}- \mu _\star +\frac{\Delta _{ \min }}{2}\right) ^2=2 \left( {\hat{\mu }}^{n}_{k}(t - 1)- \mu _\star +\frac{\Delta _{ \min }}{2}\right) ^2&\le {{\,\mathrm{KL}\,}}\left( {\hat{\mu }}^{n}_{k}(t - 1), \mu _{\star }-\frac{\Delta _{ \min }}{2}\right) \\&\le \frac{\log (f_\alpha (t))}{V_k^n(t - 1)} \le \frac{\log f(A_j)}{s}. \end{aligned}$$

Thus, for some \(k \in [K]\backslash \{\star \}\) and \(s \ge \frac{A_j-A_{j-1}}{K/N+3}-1\),

$$\begin{aligned} {\hat{\mu }}^{n}_{k,s}&\ge \mu _{\star }-\frac{\Delta _{\min }}{2}-\sqrt{\frac{\log f(A_j)}{2s}}\ge \mu _k+\frac{\Delta _{\min }}{2}-\sqrt{\frac{\log f(A_j)}{2s}}\ge \mu _k+\frac{\Delta _{\min }}{4}, \end{aligned}$$

since \(A_j-A_{j-1} \ge \frac{8}{\Delta ^2}\left( \frac{K}{N}+3\right) {\log f(A_j)}\). Thus, by Hoeffding’s inequality we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[\chi ^n_j]&\le \sum _{k \in [K]\backslash \{\star \}}\sum _{s \ge \frac{A_j-A_{j-1}}{K/N+3}-1}{{\,\mathrm{\mathbb {P}}\,}}\left[ {\hat{\mu }}^{n}_{k,s} \ge \mu _k+\frac{\Delta _{\min }}{4}\right] \\&\le (K-1)\sum _{s \ge \frac{A_j-A_{j-1}}{K/N+3}-1} \exp \left( -\frac{s \Delta _{\min }^2}{8}\right) \\&\le K\int _{\frac{A_j-A_{j-1}}{K/N+3}-2}^{\infty } \exp \left( -\frac{s \Delta _{\min }^2}{8}\right) \mathrm {d}s\\&\le \frac{8K}{\Delta _{\min }^2} \exp \left( -\frac{\Delta _{\min }^2 (A_j-A_{j-1})}{16 (K/N+3)}\right) . \end{aligned}$$

\(\square \)

In what follows, we let \(p_{\min }:=\min \left( \{P(i,j)\}_{(i,j) \in [N]^2}\backslash \{0\}\right) \) and let \(\mathrm {diam}(P)\) denote the maximum length of a directed path between two distinct nodes corresponding to the graph induced by P.

Lemma 3.8

For \(\xi \in {{\,\mathrm{\mathbb {N}}\,}}\), \({{\,\mathrm{\mathbb {P}}\,}}({\hat{\tau }}_{\mathrm {spr}}\ge \xi ) \le N(1-p_{\min }^{\mathrm {diam}(P)})^{\left\lfloor \frac{\xi }{2\mathrm {diam}(P)}-1\right\rfloor }\).

Proof

Recall that \({\hat{\tau }}_{\mathrm {spr}}\) is the number of phases since \({\hat{\tau }}_{\mathrm {stab}}\), so we can assume that if an agent has the best arm it will recommend it. Therefore, to find an upper bound for this probability, we consider a single path from an agent with the optimal arm (\(n_{\star }\)) to the chosen node n and the probability that there exists a single node in this path that does not request a recommendation from the prior node. And therefore, the best arm does not spread along this path.

Fix an agent \(n \in [N]\) and choose a sequence of nodes \((\ell _i)_{i \in [q]\cup \{0\}} \in [N]^q\) with \(q \le \mathrm {diam}(P)\) and such that \(\ell _0=n_{\star }\), \(\ell _q=n\) and \(P(\ell _{i},\ell _{i-1})>0\) for each \(i \in [q]\). Note that the definition of \(\mathrm {diam}(P)\) entails the existence of at least one such a sequence. Recall that we let \(Q_j^{{\tilde{n}}}\) denote the node which sends a message to agent \({\tilde{n}}\) and the end of phase j. Let \(m=\lfloor \xi /(2q)-1\rfloor \) and observe that if for some \(j_0\in \{{\hat{\tau }}_{\mathrm {stab}}, \ldots , {\hat{\tau }}_{\mathrm {stab}}+2mq\} \) we have \(Q_j^{\ell _{j-j_0}}=\ell _{j-j_0-1}\) for \(j \in \{j_0+1,\ldots ,j_0+q\}\) then \({\hat{\tau }}^n_{\mathrm {spr}}+{\hat{\tau }}_{\mathrm {stab}}\le j_0+q < \xi +{\hat{\tau }}_{\mathrm {stab}}\). Hence, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}({\hat{\tau }}_{\mathrm {spr}}^n \ge \xi )&\le {{\,\mathrm{\mathbb {P}}\,}}\left( \bigcap _{j_0 -{\hat{\tau }}_{\mathrm {stab}} \in \{0,1, \ldots , 2mq\}} \bigcup _{j \in \{j_0+1,\ldots ,j_0+q\}} \left\{ Q_j^{\ell _{j-j_0}}\ne \ell _{j-j_0-1} \right\} \right) \\&\le {{\,\mathrm{\mathbb {P}}\,}}\left( \bigcap _{j_0 -{\hat{\tau }}_{\mathrm {stab}} \in \{0,2q, \ldots , 2mq\}} \bigcup _{j \in \{j_0+1,\ldots ,j_0+q\}} \left\{ Q_j^{\ell _{j-j_0}}\ne \ell _{j-j_0-1} \right\} \right) \\&= \prod _{j_0 -{\hat{\tau }}_{\mathrm {stab}} \in \{0,2q, \ldots , 2mq\}} {{\,\mathrm{\mathbb {P}}\,}}\left( \bigcup _{j \in \{j_0+1,\ldots ,j_0+q\}} \left\{ Q_j^{\ell _{j-j_0}}\ne \ell _{j-j_0-1} \right\} \right) \\&= \prod _{j_0 -{\hat{\tau }}_{\mathrm {stab}} \in \{0,2q, \ldots , 2mq\}} \left\{ 1-{{\,\mathrm{\mathbb {P}}\,}}\left( \bigcap _{j \in \{j_0+1,\ldots ,j_0+q\}} \left\{ Q_j^{\ell _{j-j_0}} = \ell _{j-j_0-1} \right\} \right) \right\} \\&= \prod _{j_0 -{\hat{\tau }}_{\mathrm {stab}} \in \{0,2q, \ldots , 2mq\}} \left\{ 1-\prod _{j \in \{j_0+1,\ldots ,j_0+q\}} {{\,\mathrm{\mathbb {P}}\,}}\left( Q_j^{\ell _{j-j_0}} = \ell _{j-j_0-1} \right) \right\} \\&\le (1-p_{\min }^q)^m \le (1-p_{\min }^{\mathrm {diam}(P)})^{\left\lfloor \frac{\xi }{2\mathrm {diam}(P)}-1\right\rfloor }. \end{aligned}$$

The lemma now follows by the union bound over [N]. \(\square \)

The following lemma gives us a bound for the time at which each phase starts (and ends) using the phase lengths.

Lemma 3.9

Suppose that there exist \(C\ge 1\), \(\theta >0\) such that \(C^{-1} j^\theta \le A_{j}-A_{j-1} \le Cj^{\theta }\) for all \(j \in {{\,\mathrm{\mathbb {N}}\,}}\). Then we have \(\frac{C^{-1}}{1 + \theta } j^{1+\theta } \le A_j \le \frac{C}{1 + \theta }(1+j)^{1+\theta }\) for all \(j \in {{\,\mathrm{\mathbb {N}}\,}}\).

Proof

We have that

$$\begin{aligned} A_j = \sum _{i=1}^{j}A_j - A_{j-1} \end{aligned}$$

since \(A_0 := 0\). Therefore,

$$\begin{aligned} C^{-1}\sum _{i=1}^{j}i^\theta \le A_j \le C\sum _{i=1}^{j}i^\theta . \end{aligned}$$

Since \(j^\theta \) is increasing, we can bound the sums as follows

$$\begin{aligned} C^{-1}\int _{0}^{j}i^\theta \mathrm {d}i \le A_j \le C\int _{0}^{j}(i+1)^\theta \mathrm {d}i. \end{aligned}$$

And this gives the desired result:

$$\begin{aligned} \frac{C^{-1}}{1 + \theta } j^{1+\theta } \le A_j \le \frac{C}{1 + \theta }(1+j)^{1+\theta }. \end{aligned}$$

\(\square \)

Now define the phase \({\underline{j}}(\Delta _{\min }) \in {{\,\mathrm{\mathbb {N}}\,}}\) by

$$\begin{aligned} {\underline{j}}(\Delta _{\min }):=4+\max \left( \{0\}\cup \left\{ j \in {{\,\mathrm{\mathbb {N}}\,}}~:~ j^{\theta } < \frac{8C(1 + \theta )}{\Delta _{\min }^2}\left( \frac{K}{N}+3\right) \log f\left( \frac{C}{1 + \theta }(1+j)^{\theta }\right) \right\} \right) . \end{aligned}$$

Note that \({\underline{j}}(\Delta _{\min })\) is always finite since \(f(t)=O(\log t)\). This phase is conveniently defined by considering Lemma 3.9 and with the purpose of applying Lemma 3.7 in Lemma 3.10.

Lemma 3.10

Suppose that there exist constants \(C\ge 1\), \(\theta >0\) such that \(C^{-1} j^\theta \le A_{j}-A_{j-1} \le Cj^{\theta }\) for all \(j \in {{\,\mathrm{\mathbb {N}}\,}}\). Then for all \(\xi \ge {\underline{j}}(\Delta _{\min })\) we have

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}( {\hat{\tau }}_{\mathrm {stab}} \ge \xi ) \le \sum _{n \in [N]}{{\,\mathrm{\mathbb {P}}\,}}(\kappa ^n_{\circ }> C^{-1}(\xi -2)^{1+\theta })+\frac{8KN}{\Delta _{\min }^2} \sum _{j \ge \xi }\exp \left( -\frac{\Delta _{\min }^2j^\theta }{16C(K/N+3)}\right) . \end{aligned}$$

Proof

Fix an agent \(n \in [N]\) and suppose that \({\hat{\tau }}^n_{\mathrm {stab}} \ge \xi \). Since \({\hat{\tau }}^n_{\mathrm {stab}}:=\min \{j \in {{\,\mathrm{\mathbb {N}}\,}}~:~ A_{j-1} \ge \kappa ^n_{\circ }, \forall j' \ge j,~\chi _{j'}^n=0\}\) it follows that either \(A_{\xi -2} < \kappa ^n_{\circ }\) or \(\chi _{j}^n=1\) for some \(j \ge \xi - 1\). Note also that by the upper bound in Lemma 3.9 for \(j \ge \xi \ge {\underline{j}}(\Delta _{\min })\) we have

$$\begin{aligned} A_j-A_{j-1}&\ge C^{-1} j^\theta \ge \frac{8}{\Delta _{\min }^2}\left( \frac{K}{N}+3\right) \log f\left( \frac{C}{1 + \theta }(1+j)^{\theta }\right) \\&\ge \frac{8}{\Delta ^2_{\min }}\left( \frac{K}{N}+3\right) {\log f(A_j)}. \end{aligned}$$

Hence, by Lemmas 3.7 and the lower bound in 3.9 we have

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}( {\hat{\tau }}^n_{\mathrm {stab}} \ge \xi )&\le {{\,\mathrm{\mathbb {P}}\,}}( A_{\xi -2} < \kappa ^n_{\circ })+\sum _{j \ge \xi - 1}{{\,\mathrm{\mathbb {E}}\,}}[\chi ^n_j]\\&\le {{\,\mathrm{\mathbb {P}}\,}}(\kappa ^n_{\circ }> \frac{C^{-1}}{1 + \theta }(\xi -2)^{1+\theta })+\frac{8K}{\Delta _{\min }^2} \sum _{j \ge \xi - 1}\exp \left( -\frac{\Delta _{\min }^2 (A_j-A_{j-1})}{16 (K/N+3)}\right) \\&\le {{\,\mathrm{\mathbb {P}}\,}}(\kappa ^n_{\circ }> \frac{C^{-1}}{1 + \theta }(\xi -2)^{1+\theta })+\frac{8K}{\Delta _{\min }^2} \sum _{j \ge \xi - 1}\exp \left( -\frac{\Delta _{\min }^2j^\theta }{16C(K/N+3)}\right) .\\ \end{aligned}$$

Once again, conclusion of the lemma follows by union bounding over \(n \in [N]\). \(\square \)

Proposition 3.11

Suppose that there exist \(C\ge 1\), \(\theta >0\) such that \(C^{-1} j^\theta \le A_{j}-A_{j-1} \le Cj^{\theta }\) for all \(j \in {{\,\mathrm{\mathbb {N}}\,}}\). Then there exists a constant \(\phi \equiv \phi (\Delta _{\min },C,\theta ,N,K,p_{\min },\mathrm {diam}(P))\) depending on \(\Delta _{\min },C,\theta ,N,K,p_{\min },\mathrm {diam}(P)\) but not T such that \({{\,\mathrm{\mathbb {E}}\,}}[A_{\tau }] \le \phi \).

Proof

Given \(A_{{{\hat{\tau }}}} \ge \zeta \ge \frac{C}{1+\theta }(1+2{\underline{j}}(\Delta _{\min }))^{1+\theta } \vee \frac{C}{1 + \theta } \cdot \{16 \mathrm {diam}(P)\}^{1+\theta }\) then \({\hat{\tau }} \ge (\frac{\zeta (1 + \theta )}{C})^{\frac{1}{1+\theta }}-1\), so \({\hat{\tau }}_{\mathrm {spr}} \vee {\hat{\tau }}_{\mathrm {stab}} \ge \{(\frac{\zeta (1 + \theta )}{C})^{\frac{1}{1+\theta }}-1\}/2\ge {\underline{j}}(\Delta _{\min })\). Hence, for \(\zeta \ge \psi \equiv \psi (\Delta _{\min },C,\theta ):= \frac{C}{1 + \theta }(1+2{\underline{j}}(\Delta _{\min }))^{1+\theta }\vee \frac{C}{1 +\theta } \{16 \mathrm {diam}(P)\}^{1+\theta }\),

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}(A_{{{\hat{\tau }}}} \ge \zeta )&\le {{\,\mathrm{\mathbb {P}}\,}}\left( {\hat{\tau }}_{\mathrm {spr}} \ge \frac{1}{2}\{\left( \frac{\zeta \left( 1 + \theta \right) }{C}\right) ^{\frac{1}{1+\theta }}-1\}\right) +{{\,\mathrm{\mathbb {P}}\,}}\left( {\hat{\tau }}_{\mathrm {stab}} \ge \frac{1}{2}\{\left( \frac{\zeta (1 + \theta )}{C}\right) ^{\frac{1}{1+\theta }}-1\}\right) \\&\le N(1-p_{\min }^{\mathrm {diam}(P)})^{\big \lfloor \frac{(\zeta (1 + \theta )/C)^{\frac{1}{1+\theta }}}{4\mathrm {diam}(P)}-2\big \rfloor } \\&\quad +\frac{8KN}{\Delta _{\min }^2} \int _{z \ge (\zeta (1 + \theta )/C)^{\frac{1}{1+\theta }}/2-3}\exp \left( -\frac{\Delta _{\min }^2z^\theta }{16C(K/N+3)}\right) \mathrm {d}z\\&\quad +\sum _{n \in [N]}{{\,\mathrm{\mathbb {P}}\,}}(\kappa ^n_{\circ }> \{(\zeta (1+ \theta )/C)^{\frac{1}{1+\theta }}/2-4\}^{1+\theta }/(C(1 + \theta ))\\&\le N(1-p_{\min }^{\mathrm {diam}(P)})^{\frac{(\zeta (1+\theta )/C)^{\frac{1}{1+\theta }}}{2^4\mathrm {diam}(P)}}\\&\quad +\frac{8KN}{\Delta _{\min }^2} \int _{z \ge (\zeta (1 + \theta )/C)^{\frac{1}{1+\theta }}/2-3}\exp \left( -\frac{\Delta _{\min }^2z^\theta }{16C(K/N+3)}\right) \mathrm {d}z\\&\quad +\sum _{n \in [N]}{{\,\mathrm{\mathbb {P}}\,}}(\kappa ^n_{\circ }> (2^{1+\theta }C)^{-2} \cdot \zeta ). \end{aligned}$$

And by Lemma 3.2, we have that

$$\begin{aligned} \sum _{n \in N}\sum _{\zeta \in {{\,\mathrm{\mathbb {N}}\,}}}{{\,\mathrm{\mathbb {P}}\,}}(\kappa ^n_{\circ }> (2^{1+\theta }C)^{-2} \cdot \zeta )&= \sum _{n \in N}\sum _{\zeta \in {{\,\mathrm{\mathbb {N}}\,}}}{{\,\mathrm{\mathbb {P}}\,}}((2^{1+\theta }C)^{2}\kappa ^n_{\circ }> \zeta ) \\&= (2^{1+\theta }C)^{2}\sum _{n \in N} {{\,\mathrm{\mathbb {E}}\,}}[k^n_\circ ] \le (2^{1 + \theta } C)^2\frac{8N}{\Delta _{\min }^2}. \end{aligned}$$

Therefore,

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[A_{{{\hat{\tau }}}}]&\le \psi + \sum _{\zeta> \psi } \left\{ N(1-p_{\min }^{\mathrm {diam}(P)})^{\frac{(\zeta (1 + \theta )/C)^{\frac{1}{1+\theta }}}{2^4\mathrm {diam}(P)}}\right\} \\&\quad + \sum _{\zeta > \psi }\left\{ \frac{8K}{\Delta _{\min }^2} \int _{z \ge (\zeta (1 + \theta )/C)^{\frac{1}{1+\theta }}/2-3}\exp \left( -\frac{\Delta _{\min }^2z^\theta }{16C(K/N+3)}\right) \mathrm {d}z\right\} \\&\quad + (2^{1 + \theta } C)^2\frac{8N}{\Delta _{\min }^2}\equiv \phi < \infty . \end{aligned}$$

where \(\phi \) is a constant not depending on the time horizon T. \(\square \)

Theorem 3.1 follows from Corollary 3.6 combined with Proposition 3.11 and by taking \(\epsilon \rightarrow 0\).

3.1 Finite Sample Bound

To derive a finite sample bound, we will make two additional assumptions. Firstly, we will assume that the neighbours that a recommendation is pulled from are chosen uniformly at random (Definition 3.12). This will allow for a tighter bound on \({\hat{\tau }}_{\mathrm {spr}}\) which will depend on the conductance and degree of the nodes. We will also assume that the phase lengths grow such that \(C^{-1}j^\theta \le A_j - A_{j-1} \le C j^\theta \) where \(C \ge 1, \theta > 1\). We now define the required graph properties.

The degree of a node \(n \in [N]\) is defined by

$$\begin{aligned} d_n:=\sum _{j \in [N]}{{\,\mathrm{\mathbbm {1}}\,}}\{P(i,j)\ne 0\}. \end{aligned}$$

The conductance \(\phi \) of P is defined as

$$\begin{aligned} \phi (P) := \min _{S \subset [N], S \ne \emptyset }\frac{\sum _{i \in S, j \in S^c}P(i, j)}{\frac{1}{N} |S|\cdot |S^c|}. \end{aligned}$$

Definition 3.12

We say that P satisfies the uniform-pull condition if for each \(i \in [N]\), with \(d_i=\sum _{j \in [N]}{{\,\mathrm{\mathbbm {1}}\,}}\{P(i,j)\ne 0\}\) we have \(P(i,j) \in \{0,1/d_i\}\) for all \(j \in [N]\).

We will bound \({\hat{\tau }}_{\mathrm {spr}}\) by using that it is stochastically dominated by a random variable \(\tau _{spr}\) representing the time it takes for a rumour to spread on a graph according to a pull model where neighbours are chosen uniformly at random. The following result from [7, Lemma 4] gives us a bound on this rumour spreading time \(\tau _{\mathrm {spr}}\).

Lemma 3.13

(Giakkoupis, 2011) For all \(\beta > 0\), we have that

$$\begin{aligned} {{\,\mathrm{\mathbb {P}}\,}}\left( \tau _{\mathrm {spr}} > 50 (\beta + 2) \log N \left( \phi ^{-1} + \frac{d_{\mathrm {max}}}{\lceil \phi \cdot d_{n_{\star }}\rceil }\right) \right) \le 3 N^{-\beta }, \end{aligned}$$

where \(d_{\mathrm {max}} := \max _{n \in [N]}d_n\) is the maximal degree of the graph and \(n_{\star }\) the agent where \(\star \in S_{\circ }^{n_{\star }}\).

The following lemma, similar to Lemma 3.11, bounds \({{\,\mathrm{\mathbb {E}}\,}}[A_\tau ]\) from above with a time horizon-independent expression. Owing to the additional assumptions we have made, we are able to get a more explicit bound.

Lemma 3.14

Suppose that P satisfies the uniform-pull condition (3.12). Suppose further that there exist \(C\ge 1\), \(\theta >1\) such that \(C^{-1} j^\theta \le A_{j}-A_{j-1} \le Cj^{\theta }\) for all \(j \in {{\,\mathrm{\mathbb {N}}\,}}\). Suppose that each neighbour is equally likely to be chosen. Then we have that

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[A_{\hat{\tau }}]&\le \frac{C}{1 + \theta }(1+2{\underline{j}}(\Delta _{\min }))^{1+\theta }+ (2^{1 + \theta } C)^2\frac{8N}{\Delta _{\min }^2} + \frac{\lceil \theta \rceil !128 C^2 K 64^\theta (K/N + 3)^{\theta + 1}}{\Delta ^{4 + 2\theta }_{\mathrm {min}}}\\&\quad +\, 3C \lceil \theta \rceil !\left( 400\left( \phi ^{-1} + \frac{d_{\mathrm {max}}}{\lceil \phi \cdot d_{n_{\star }}\rceil }\right) \right) ^{\theta }. \end{aligned}$$

Proof

Given \(A_{{{\hat{\tau }}}} \ge \zeta \ge \frac{C}{1+\theta }(1+2{\underline{j}}(\Delta _{\min }))^{1+\theta }\), then \({\hat{\tau }} \ge (\frac{\zeta (1 + \theta )}{C})^{\frac{1}{1+\theta }}-1\), so \({\hat{\tau }}_{\mathrm {spr}} \vee {\hat{\tau }}_{\mathrm {stab}} \ge \{(\frac{\zeta (1 + \theta )}{C})^{\frac{1}{1+\theta }}-1\}/2\ge {\underline{j}}(\Delta _{\min })\). Hence, for \(\zeta \ge \psi \equiv \psi (\Delta _{\min },C,\theta ):= \frac{C}{1 + \theta }(1+2{\underline{j}}(\Delta _{\min }))^{1+\theta }\), and by the same approach as Proposition 3.11, we arrive at

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[A_{{{\hat{\tau }}}}]&\le \frac{C}{1 + \theta }(1+2{\underline{j}}(\Delta _{\min }))^{1+\theta }\\&\quad +(2^{1 + \theta } C)^2\frac{8N}{\Delta _{\min }^2}\\&\quad + \sum _{\zeta> \psi } \left\{ {{\,\mathrm{\mathbb {P}}\,}}\left( {\hat{\tau }}_{\mathrm {spr}} \ge \frac{1}{2}\left\{ \left( \frac{\zeta \left( 1 + \theta \right) }{C}\right) ^{\frac{1}{1+\theta }} - 1\right\} \right) \right\} \\&\quad + \sum _{\zeta > \psi }\left\{ \frac{8K}{\Delta _{\min }^2} \int _{z \ge (\zeta (1 + \theta )/C)^{\frac{1}{1+\theta }}/2-3}\exp \left( -\frac{\Delta _{\min }^2z^\theta }{16C(K/N+3)}\right) \mathrm {d}z\right\} . \end{aligned}$$

It suffices to bound the third and fourth terms. Firstly, we will bound the third term using 3.13. For notational convenience, define

$$\begin{aligned} \Omega := 100 \log N \left( \phi ^{-1} + \frac{d_{\max }}{\lceil \phi d_{n_{\star }}\rceil }\right) . \end{aligned}$$

We have that

$$\begin{aligned} \frac{1}{3} \cdot \sum _{\zeta> \psi } {{\,\mathrm{\mathbb {P}}\,}}\left( {\hat{\tau }}_{\mathrm {spr}} \ge \frac{1}{4}\left\{ \left( \frac{\zeta \left( 1 + \theta \right) }{C}\right) ^{\frac{1}{1+\theta }}\right\} \right)&\le \frac{1}{3} \cdot \sum _{\zeta> \psi } {{\,\mathrm{\mathbb {P}}\,}}\left( \tau _{\mathrm {spr}} \ge \frac{1}{4}\left\{ \left( \frac{\zeta \left( 1 + \theta \right) }{C}\right) ^{\frac{1}{1+\theta }}\right\} \right) \\&\le \sum _{\zeta > \psi }N^{-\frac{1}{4\Omega }\left( \frac{\zeta \left( 1 + \theta \right) }{C}\right) ^{\frac{1}{1+\theta }}}\\&\le \int _\psi N^{-\frac{1}{4\Omega }\left( \frac{\zeta \left( 1 + \theta \right) }{C}\right) ^{\frac{1}{1+\theta }}}\mathrm {d}\zeta \\&\le \int _\psi \exp \left( {-\frac{\log N}{4\Omega }\left( \frac{\zeta \left( 1 + \theta \right) }{C}\right) ^{\frac{1}{1+\theta }}}\right) \mathrm {d}\zeta \\&\le C \left( \frac{4\Omega }{\log N}\right) ^{\theta } \int _0^\infty x^{\theta } \exp (-x)\mathrm {d}x\\&\le C \left( \frac{4\Omega }{\log N}\right) ^{\theta } \Gamma (\theta )\\&\le C \Gamma ( \theta )\left( 400\left( \phi ^{-1} + \frac{d_{\mathrm {max}}}{\lceil \phi \cdot d_{n_{\star }}\rceil }\right) \right) ^{\theta }\\&\le C \lceil \theta \rceil !\left( 400\left( \phi ^{-1} + \frac{d_{\mathrm {max}}}{\lceil \phi \cdot d_{n_{\star }}\rceil }\right) \right) ^{\theta },\\ \end{aligned}$$

where the first inequality holds since \({\hat{\tau }}_{\mathrm {spr}}\) is stochastically dominated by \(\tau _{\mathrm {spr}}\), the second inequality holds from Lemma 3.13 and the fifth inequality holds from a change of variables.

We will now bound the fourth term. We start by bounding the integral

$$\begin{aligned} \int _{\frac{1}{2}(\frac{\zeta (1 + \theta )}{C})^{\frac{1}{1+\theta }} -3 }^\infty \exp \left( -\frac{\Delta _{\min }^2z^\theta }{16C(K/N+3)}\right) \mathrm {d}z&\le \int _{\frac{1}{2}(\frac{\zeta (1 + \theta )}{C})^{\frac{1}{1+\theta }} -3 }^\infty \exp \left( -\frac{\Delta _{\min }^2z}{16C(K/N+3)}\right) \mathrm {d}z \\&\le \frac{16C(K/N + 3)}{\Delta _{\mathrm {min}}^2}\exp \left( -\frac{1}{2}\left( \frac{\zeta (1 + \theta )}{C}\right) ^{\frac{1}{1+\theta }} -3 \right) \\&\le \frac{16C(K/N + 3)}{\Delta _{\mathrm {min}}^2}\exp \left( -\frac{1}{4}\left( \frac{\zeta (1 + \theta )}{C}\right) ^{\frac{1}{1+\theta }}\right) . \end{aligned}$$

And therefore, we have

$$\begin{aligned} \sum _{\zeta> \psi }&\left\{ \frac{8K}{\Delta _{\min }^2} \int _{\frac{1}{2}(\frac{\zeta (1 + \theta )}{C})^{\frac{1}{1+\theta }} -3 }^\infty \exp \left( -\frac{\Delta _{\min }^2z^\theta }{16C(K/N+3)}\right) \mathrm {d}z\right\} \\&\le \frac{128 C K(K/N + 3)}{\Delta ^4_{\mathrm {min}}}\sum _{\zeta > \psi } \exp \left( -\frac{1}{4}\left( \frac{\zeta (1 + \theta )}{C}\right) ^{\frac{1}{1+\theta }}\right) \\&\le \frac{128 C K(K/N + 3)}{\Delta ^4_{\mathrm {min}}}\int _{\psi }^\infty \exp \left( -\frac{1}{4}\left( \frac{\zeta (1 + \theta )}{C}\right) ^{\frac{1}{1+\theta }}\right) \mathrm {d}\zeta \\&\le \frac{128 C^2 K 64^\theta (K/N + 3)^{\theta + 1}}{\Delta ^{4 + 2\theta }_{\mathrm {min}}}\int _{0}^\infty x^\theta \exp (-x) \mathrm {d}x\\&\le \frac{\lceil \theta \rceil !128 C^2 K 64^\theta (K/N + 3)^{\theta + 1}}{\Delta ^{4 + 2\theta }_{\mathrm {min}}}. \end{aligned}$$

\(\square \)

The following result is a direct consequence of Corollary 3.6 and Lemma 3.14.

Theorem 3.15

Suppose that there exist \(C\ge 1\), \(\theta >1\) such that \(C^{-1} j^\theta \le A_{j}-A_{j-1} \le Cj^{\theta }\) for all \(j \in {{\,\mathrm{\mathbb {N}}\,}}\). Then for each agent \(n \in [N]\), we have the following regret bound

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[{\mathcal {R}}^n_T]&\le \frac{C}{1 + \theta }(1+2{\underline{j}}(\Delta _{\min }))^{1+\theta }+ (2^{1 + \theta } C)^2\frac{8N}{\Delta _{\min }^2} + \frac{\lceil \theta \rceil !128 C^2 K 64^\theta (K/N + 3)^{\theta + 1}}{\Delta ^{4 + 2\theta }_{\mathrm {min}}}\\&\quad + \underbrace{3C \lceil \theta \rceil !\left( 400\left( \phi ^{-1} + \frac{d_{\mathrm {max}}}{\lceil \phi \cdot d_{n_{\star }})\rceil }\right) \right) ^{\theta }}_{\text {Impact from graph}}\\&\quad +\sum _{k \in S^n_\circ \backslash [\star ]} \Delta _k \inf _{\epsilon \in \left( 0,\frac{\Delta _{\min }}{2}\right) }\left\{ \frac{\log f(T)}{{{\,\mathrm{KL}\,}}(\mu _k+{\epsilon },\mu _{\star }-\epsilon )}+\frac{3}{{\epsilon }^2}\right\} . \end{aligned}$$

This bound provides an insight into effect the initial phases (before \(\tau \)) might have on the regret. We observe this bound is large when either \(\Delta _{\mathrm {min}}\) or the conductance \(\phi \) are small or when either \(\theta \) or the ratio \(\frac{d_{\mathrm {max}}}{d_{n_{\star }}}\) is large. Additionally, \(\theta \) amplifies the affect of these parameters on the regret bound.

Figure 3 shows the results of using different networks with Algorithm 1 from a series of simulations. These simulations score the regret for different graphs, in order from lowest to highest, as complete, cycle and star. The Impact from graph term from Corollary 3.14 can help explain these results.

Firstly, the complete graph has conductance \(\phi = \frac{N}{2(N - 1)}\) so the impact of the complete graph on the regret bound is

$$\begin{aligned} 3C \lceil \theta \rceil !\left( 400\left( 4\right) \right) ^{\theta }. \end{aligned}$$

For the cycle graph, the conductance is \(\phi = 2/N\) so the graph impact is

$$\begin{aligned} 3C \lceil \theta \rceil !\left( 400\left( N/4\right) \right) ^{\theta }. \end{aligned}$$

Finally, the impact of star graph depends on whether the best arm begins on the central node or a leaf node. Since the conductance of the star graph is \(\phi = \frac{N - 1}{N}\), we have the following scaling in each case

$$\begin{aligned} \underbrace{3C \lceil \theta \rceil !\left( 400\left( 2\right) \right) ^{\theta }}_{\text {Central Node}}\underbrace{3C \lceil \theta \rceil !\left( 400(N-1)\right) ^{\theta }}_{\text {Leaf Node}}. \end{aligned}$$

4 Numerical Results

Here we will compare Algorithm 1 and the GosInE algorithm on a range of synthetic data. We compare variants of both of these algorithms using Hoeffding and KL upper confidence bounds. For GosInE, the Hoeffding and KL variants are, respectively, labelled UCB-GIE and KLUCB-GIE, and for Algorithm 1, they are labelled GIE-FE (Gossip-Insert-Eliminate with Fast Elimination) and AOGB.

The experiments are conducted in two settings: \(N, K = (20, 50)\) and \(N, K = (10, 100)\). Each experiment consists of 100 independent runs, and in each run, the regret is averaged over the nodes. In each experiment, the algorithms encounter the same reward sequence. The first two experiments assume the agents are connected via a complete graph, while the third experiment compares different graphs. We compute the regret over a time horizon of \(T = 100,000\) and plot the sample mean along with 95% confidence intervals. Other than in Fig. 4, we take the phase lengths to grow cubically, i.e. \(A_j = j^3\), and other than in Fig. 3, we assume that agents are connected via a complete graph.

Choice of \(\alpha \): We begin by comparing Algorithm 1 and GosInE for the two different types of upper confidence bounds by varying the exploration function \(f(t) = 1 + t^\alpha \log ^2(t)\) by choosing different values for \(\alpha \). From Fig. 1, we identify that Algorithm 1 and GosInE perform better when equipped with KL upper confidence bound. Additionally, Algorithm 1 outperforms GosInE when they are both equipped with the same upper confidence bounds. Overall, performance is better for the smaller values of \(\alpha \) and regret is minimised somewhere in the region \(\alpha \le 1\). This implies that there may be more practical choices for \(f_\alpha (t)\) than the asymptotically optimal choice at \(\alpha = 1\).

Fig. 1
figure 1

Regret for different choices of \(\alpha \) with \(\mu _{\star } = 0.9\) and the rest of the arms divide the interval [0.2, 0.8] uniformly

\(\Delta _{\min }\) vs Regret: Now we consider the effect of changing the suboptimality gap \(\Delta _{\min }\). This is the difference between the means of the best and the second-best arms. Figure 2 compares Algorithm 1 and the GosInE algorithm for both types of confidence intervals. Similarly to the previous experiment, we observe that both algorithms perform better when equipped with the KL upper confidence bounds and that Algorithm 1 typically outperforms GosInE on average when they are equipped with the same upper confidence bounds.

Fig. 2
figure 2

Regret for different choices of \(\Delta _{\min }\) with \(\alpha = 1\). The best arm has mean \(\mu _\star = 0.9\) and the rest of the arms divide the interval \([0.9 - \Delta _{\min }, 0.2]\) uniformly

Network Configurations Here we compare three different network configurations for agents implementing Algorithm 1: a complete graph, a cycle graph and a star graph.

Fig. 3
figure 3

Regret over time for three different networks. Each in case we consider \(\alpha = 1\), \(\Delta _{\min } = 0.1\) and the means of the remaining arms divide the interval [0.8, 0.2] uniformly

The results in Fig. 3 show that the cycle graph performs slightly worse than the complete graph but the star graph struggles significantly along with a larger variance. In essence, this is because the best arm needs to spread to centre of the star before it can spread to all of the other nodes.

Phase Lengths

In Fig. 4, we see the effect of changing the communication rounds \(A_j\) on the regret. We consider three polynomial different functions, \(j^2, j^3, j^4\), as these satisfy the assumptions in our theoretical analysis. We observe that increasing the phase lengths (and thus decreasing the number of communication rounds) incurs more regret in the initial time steps in both cases, which is as expected.

Fig. 4
figure 4

Regret over time for different choices of \(A_j\). Each in case we consider \(\alpha = 1\), \(\Delta _{\min } = 0.1\) and the means of the remaining arms divide the interval [0.8, 0.2] uniformly

5 Discussion

In this paper, we presented an algorithm (Algorithm 1) for multi-agent bandits in a decentralised setting. Our algorithm builds upon the Gossip-Insert-Eliminate algorithm of [3] by making two modifications. First, we use tighter confidence intervals inspired by [6]. Second, we use a faster elimination scheme for reducing the number of arms that must be explored by an agent. Both modifications yield significant empirical improvement on simulated data (Fig. 2). Finally, we prove a regret bound (Theorem 3.1) which demonstrates asymptotically optimal performance of our algorithm, matching the asymptotic performance of a collection of agents with unlimited communication.

There is substantial scope for future work in this direction. One challenge of great practical importance is the development of distributed algorithms which are robust to both malicious agents and faulty communication [13]. An interesting theoretical challenge is to develop a multi-agent bandit algorithm which is both asymptotically optimal and nearly minimax optimal with limited communication. In very recent work of [1], an algorithm has been proposed which is minimax optimal in the distributed setting, and it would be interesting to synthesise this with the insights provided in the current paper.