Optimal Hyper-Scalable Load Balancing with a Strict Queue Limit

Load balancing plays a critical role in efficiently dispatching jobs in parallel-server systems such as cloud networks and data centers. A fundamental challenge in the design of load balancing algorithms is to achieve an optimal trade-off between delay performance and implementation overhead (e.g. communication or memory usage). This trade-off has primarily been studied so far from the angle of the amount of overhead required to achieve asymptotically optimal performance, particularly vanishing delay in large-scale systems. In contrast, in the present paper, we focus on an arbitrarily sparse communication budget, possibly well below the minimum requirement for vanishing delay, referred to as the hyper-scalable operating region. Furthermore, jobs may only be admitted when a specific limit on the queue position of the job can be guaranteed. The centerpiece of our analysis is a universal upper bound for the achievable throughput of any dispatcher-driven algorithm for a given communication budget and queue limit. We also propose a specific hyper-scalable scheme which can operate at any given message rate and enforce any given queue limit, while allowing the server states to be captured via a closed product-form network, in which servers act as customers traversing various nodes. The product-form distribution is leveraged to prove that the bound is tight and that the proposed hyper-scalable scheme is throughput-optimal in a many-server regime given the communication and queue limit constraints. Extensive simulation experiments are conducted to illustrate the results.


Introduction
Load balancing provides a crucial mechanism for efficiently distributing jobs among servers in parallel-processing systems. Traditionally, the primary objective in load balancing has been to optimize performance in terms of queue lengths or delays. Due to the immense size of cloud networks and data centers [13,20,28], however, implementation overhead (e.g. communication or memory usage involved in obtaining or storing state information) has emerged as a further key concern in the design of load balancing algorithms. Indeed, the fundamental challenge in load balancing is to achieve scalability: providing favorable delay performance, while only requiring low implementation overhead in large-scale deployments.
The seminal paper [12] approached the above challenge by imposing the natural performance criterion that the probability of non-zero delay vanishes as the number of servers grows large. It was shown that this can only be achieved with constant communication overhead per job when sufficient memory is available at the dispatcher. There are in fact schemes which achieve a vanishing delay probability with only one message per job [3,18,29] or even fewer [7], but these rely on server-initiated updates as opposed to dispatcher-driven probes. We defer a more extensive discussion of these papers and the broader literature to a later stage in this introduction.
In the present paper we pursue the same intrinsic trade-off between performance and communication overhead, but focus on the optimal performance for a potentially scarce communication budget, and our perspective is fundamentally different in two respects. First of all, we set the admissible message rate δ to be arbitrary, and in particular to be far lower than one message per job, which we refer as the 'hyper-scalable' operating regime. This range is especially relevant in scenarios with relatively tiny jobs and a correspondingly massive arrival rate which may significantly exceed the message rate that can be sustained between the dispatcher and the servers, prohibiting even just one message per job. Second, jobs may only be admitted when a strict limit K on the queue position of the job can be guaranteed. This queue limit K can have any value and is offered in systems of any size, as opposed to a zero queue length that is only ensured with high probability in a many-server regime. The combination of a low communication budget per job and a strict admission condition is particularly pertinent for high-volume packet processing applications, where zero delay may not be feasible given the admissible message rate, but where an explicit queue limit is crucial.
As the cornerstone of our analysis, we establish a universal upper bound for the achievable throughput of any dispatcher-driven algorithm as function of δ and K, thus capturing the trade-off between performance and communication overhead. We also introduce and analyze a specific hyper-scalable scheme which approaches the latter bound in a many-server regime, demonstrating that the bound is sharp.
Model set-up and hyper-scalable scheme We adopt the set-up of the celebrated supermarket model which has emerged as the canonical framework in the related literature (as further reviewed below), but add several salient features relevant for our purposes. Specifically, we consider a system with N identical servers of unit exponential rate and a single dispatcher where jobs arrive as a Poisson process of rate N λ. The dispatcher is unaware of the service requirements of jobs and cannot buffer them, but must immediately forward them to one of the servers or block them. The throughput of the system is defined as the rate of admitted jobs per server.
The blocking option is relevant since the dispatcher must enforce an explicit queue limit K, and is only allowed to admit a job and assign it to a server if it can guarantee that the queue position encountered by that job is at most K. Note that it is not enough for a job to end up in such a position thanks to a lucky guess, but that the dispatcher must have absolute certainty in advance that this is the case, and that a job must be discarded otherwise. Discarding may be the preferred option in packet processing applications when handling a packet beyond a certain tolerance window serves no useful purpose. In that case, processing an obsolete packet results in an unnecessary resource wastage and needlessly contributes to further congestion, and is thus worse than simply dropping the packet upfront.
As mentioned above, the dispatcher is oblivious of the service requirements, which are exponentially distributed and thus have unbounded support. Hence, the dispatcher critically relies on information provided by the servers in order to enforce the queue limit K, and is allowed to send probes for this purpose, requesting queue length reports at a rate N δ. In addition, the dispatcher is endowed with unlimited memory capacity, which it may use to determine which server to probe and when or to which server it will dispatch an arriving job. Servers return instantaneous queue length reports in response to probes from the dispatcher, but are not able to initiate messages or send unsolicited updates when reaching a certain status.
With the above framework in place, we will construct a specific hyperscalable scheme which is guaranteed to enforce the queue limit K and operate within the communication budget δ. The scheme toggles each individual server between two modes of operation, labeled open and closed. An open period starts when the dispatcher requests a queue length update from the server and the reported queue length is below K; during that period the server is not working, and waits for incoming jobs from the dispatcher, seeing its queue only grow. Once the queue length reaches the limit K, a closed period starts, ending when the dispatcher requests the next update after τ time units; during that period the server is continuously working as long as jobs are available, without receiving any further jobs, thus draining its queue. When the queue length reported at an update is exactly K, the open period has length zero, and the next closed period starts immediately. By construction, the above-described mechanism maintains a queue limit of K at all times and induces a message rate of at most 1/τ per server, which makes τ = 1/δ the obvious choice.

Main contributions
The main contributions of the paper may be summarized as follows. First of all, we establish a universal upper bound λ * (δ, K) for the achievable throughput of any dispatcher-driven algorithm subject to the communication budget per server in terms of δ and the queue limit K. The upper bound relies on a simple yet powerful argument which counts the number of jobs that can be admitted per message given the queue limit K and the message rate δ. While the macroscopic view of the argument covers a broad range of strategies with possibly dynamic and highly complex update rules, the nature of the upper bound strongly points to the superior properties of constant update intervals.
Armed with that insight, we propose a hyper-scalable scheme which can operate at any given message rate δ and enforce any given queue limit K. At the same time, the scheme is specifically designed to produce system dynamics that can be represented in terms of a closed product-form queueing network, in which the servers act as customers traversing various nodes. This furnishes tractable expressions for the relevant stationary distributions and in particular the blocking probability. The expression for the blocking probability is used to prove that the achieved throughput approaches the minimum of the abovementioned upper bound and the normalized job arrival rate λ in a many-server regime. This in turn demonstrates that the upper bound is tight and that the proposed hyper-scalable scheme provides optimality in the three-way trade-off among queue limit, communication and throughput.
Background on load balancing algorithms Load balancing algorithms can be broadly categorized as static (open-loop), dynamic (closed-loop), or some intermediate blend, depending on the amount of state information (e.g. queue lengths or load measurements) that is used in dispatching jobs among servers. Within the category of dynamic policies, one can further distinguish between dispatcher-driven (push-based) and server-oriented (pull-based) approaches. In the former case, the dispatcher 'pushes' jobs to the servers and takes the initiative to collect state information for that purpose, while the servers play a passive role and only provide state information when explicitly requested. In contrast, in server-oriented approaches, the servers may pro-actively share state information with the dispatcher, and indirectly 'pull' in jobs by advertising their availability or load status. The use of state information naturally allows dynamic policies to achieve better performance, but also involves higher implementation complexity (e.g. communication overhead and memory usage) as mentioned earlier. The latter issue has emerged as a pivotal concern due to the deployment of large-scale cloud networks and data centers with immense numbers of servers handling massive amounts of service requests.
The celebrated Join-the-Shortest-Queue (JSQ) policy provides the gold standard in the category of dispatcher-driven algorithms and offers strong stochastic optimality properties. Specifically, in case of identical servers, exponentially distributed service requirements and a service discipline at each server that is oblivious to the actual service requirements, the JSQ policy achieves minimum mean delay among all non-anticipating policies [10,31]. In order to implement the JSQ policy, however, a dispatcher relies on instantaneous knowledge of the queue lengths at all the servers, which may involve a prohibitive communication burden, and may not be scalable. Related is the join-below-threshold scheme [34], which is throughput-optimal, but the dispatcher-driven variant is not scalable either.
The latter issue has spurred a strong interest in so-called JSQ(d) strategies, where the dispatcher assigns incoming jobs to a server with the shortest queue among d servers selected uniformly at random. This involves d message exchanges per job (assuming d ≥ 2), and thus drastically reduces the communication overhead compared to the full JSQ policy when the number of servers N is large. At the same time, even a value as small as d = 2 yields significant performance improvements in the many-server regime N → ∞ compared to purely random assignment (d = 1) [22,30]. This is commonly referred to as the "power-of-two" effect. Similar power-of-d effects have been demonstrated for heterogeneous servers, non-exponential service requirements and loss systems in [8,9,25,26,27,32].
Unfortunately, JSQ(d) strategies lack the ability of the conventional JSQ policy to achieve zero queueing delay as N → ∞ for any finite value of d. In contrast, if d grew with N , making it possible to drive queueing delay to zero [24,17], the communication overhead would grow unboundedly. A noteworthy exception arises for batches of jobs when the value of d and the batch size grow suitably large, as can be deduced from results in [33]. Leaving batch arrivals aside though, it is in fact necessary for d to grow with N in order to achieve zero queueing delay, since results in the seminal paper [12] show that this is fundamentally impossible with a finite communication overhead per job, unless memory is available at the dispatcher to store state information.
The latter feature is exactly at the core of the so-called Join-the-Idle-Queue (JIQ) scheme [3,18], where servers advertise their availability by transferring a 'token' to the dispatcher whenever they become idle, thus generating at most one message per job. The dispatcher assigns incoming jobs to an idle server as long as tokens are outstanding, or to a uniformly at random selected server otherwise. Remarkably, the JIQ scheme has the ability of the full JSQ policy to drive the queueing delay to zero as N → ∞, even for generally distributed service requirements [11,29].
Note that for no single value of d, a JSQ(d) strategy can rival the JIQ scheme which simultaneously provides low communication overhead and asymptotically optimal performance. As alluded to above, this superiority reflects the power of server-oriented approaches in conjunction with memory at the dispatcher. The value of memory in load balancing was already studied in [1,23] in a 'ballsand-bins' context. Related work in [21] examines how much load balancing performance degrades when delayed information is used. A framework for meanfield analysis for JSQ(d) strategies with memory is developed in [19]. The authors of [2] use mean-field limits to determine the minimum required value of d for JSQ(d) strategies with memory to achieve zero queueing delay. The possibilities with limited memories are explored in [14].
Organization of the paper The main results are presented in Section 2: the upper bound for the throughput and the analysis of the hyper-scalable scheme, using a closed queueing network. In Section 3 we provide simulation results to further illustrate the behavior of the hyper-scalable scheme. An extension of the hyper-scalable scheme that also aims to minimize queue lengths is introduced in Section 4. In Section 5 we establish product-form distributions for a general closed queueing network scenario which covers both the hyper-scalable scheme and the latter extension as special cases. We conclude with some remarks and suggestions for further research in Section 6.

Main results
In this section we discuss the main results, which can be summarized as follows. There is a function λ * of δ and K, such that subject to a message rate δ and queue limit K, • the throughput of any dispatcher-driven algorithm is bounded from above by min{λ * (δ, K), λ}, • the throughput of our hyper-scalable scheme approaches min{λ * (δ, K), λ} as N → ∞.
These two results are covered in Subsections 2.1 and 2.2, respectively.

Universal upper bound
We establish the upper bound for a slightly more general scenario with heterogeneous server speeds. Denote the speed of the n-th server by µ n for n = 1, . . . , N . The next theorem shows that the achievable throughput of any dispatcher-driven algorithm subject to the message rate δ and queue limit K is bounded from above by µ n denoting the system-wide average server speed. Note that M K (τ ) may be equivalently written as and may be interpreted as the expected value of the minimum of K and a Poisson distributed random variable with mean τ .
Theorem 1. The expected number of jobs that any dispatcher-driven algorithm can admit subject to the queue limit K during a period of length T 0 with at most δN T 0 message exchanges cannot exceed 2KN + λ * (δ, K) × N T 0 , for any δ > 0.
In particular, the achievable throughput with a message rate of at most δ > 0 is bounded from above by λ * (δ, K).
Recall that we defined throughput as the rate of admitted jobs per server, and note that the throughput is naturally bounded from above by the normalized arrival rate λ.
Proof. As noted earlier, since the execution times are exponentially distributed and thus have unbounded support, the dispatcher relies on information provided by the servers in order to enforce the queue limit K. Specifically, the dispatcher earns 'passes' for admitting k jobs when a server reports k = 0, . . . , K service completions since the previous update, and cannot admit any job without relinquishing a pass. Thus, the number of jobs that the dispatcher can admit during a particular time period cannot exceed the sum of (i) the maximum possible number of KN passes initially available; (ii) the maximum possible number of KN passes earned at the first update from each server during that period, if any; and (iii) the number of additional passes obtained at further updates over intervals that fell entirely during that period, if any. Now suppose that the dispatcher requests L n queue length reports from the n-th server during a period of length T 0 , one after each of the update intervals of lengths T n,1 , . . . , T n,Ln , with Ln l=1 T n,l ≤ T 0 for all n = 1, . . . , N and L = N n=1 L n ≤ δN T 0 . Then the number of passes earned at the l-th update equals the number of service completions during the time interval T n,l , which depends on the queue length at the start of that interval. However, this random variable is stochastically bounded from above by when the queue was full with K jobs at the start of the interval. In the latter case the number of passes earned is given by the minimum of K and a Poisson distributed random variable with parameter µ n T n,l . We deduce that the expected total number of passes obtained at all these updates is bounded from above by and to prove the first statement of the theorem it thus remains to be shown that this quantity is no larger than λ * (δ, K) × N T 0 . It is easily verified that implying that M K (t) is concave as function of t. As an aside, the above expression may be intuitively explained from the fact that the first derivative ∂M K (t) ∂t equals the probability that exactly K − 1 unit-rate Poisson events occur during a period of length t, while the (negative) derivative of the latter probability equals that very same probability by virtue of the Kolmogorov equations for a pure birth process. Because of concavity, we obtain that (2) is no larger than Invoking the fact that ∂M K (t) ∂t > 0, i.e., M K (t) is increasing in t, we may write i.e., λ * (x, K) is increasing in x, and hence λ * (γ, K) ≤ λ * (δ, K), which completes the proof of the first statement of the theorem. Finally, to prove the second statement, we consider the long-term scenario T 0 → ∞. The number of jobs that are admitted per time-unit per server then equals (2KN + λ * (δ, K) × N T 0 )/(N T 0 ) → λ * (δ, K) and the message rate per server equals at most δN T 0 /(N T 0 ) = δ.
Properties of λ * We now state some properties of λ * (δ, K) and discuss their consequences, where we assume without loss of generality thatμ = 1. In the next subsection we will introduce a hyper-scalable scheme which is able to achieve this throughput in the many-server regime. For now, we will reflect the properties in light of the maximum throughput that is possible for any dispatcher-driven load balancing algorithm given a message rate δ. Proposition 1. λ * (δ, K) has the following properties: Proof. λ * (δ, K) is strictly increasing in δ because of (5) and is strictly increasing with y(δ) = 2 when δ ≥ 1 and y(δ) = K − 1 when δ < 1. All limiting statements are true for the LHS and RHS of the previous equation, therefore proving these properties for λ * (δ, K) too. The properties in Proposition 1 are visualized in Figure 1. They can be interpreted intuitively and practically too. For Property (i), when the communication budget is expanded, i.e. δ is increased, more jobs can be dispatched to queues that are guaranteed to be short. Similarly, more jobs can be admitted into the system if the queue limit is raised, i.e., K is increased. Property (i), in conjunction with Theorem 1, implies that a throughput λ * (δ, K) cannot be achieved with a message rate strictly below δ, or a queue limit strictly below K.
Property (ii) shows that as the message rate grows large, full server utilization can be achieved. With an unlimited message rate, the dispatcher is able to find idle servers immediately, a necessary requirement for achieving full server utilization irrespective of the queue limit K.
Property (iii) shows that, first, when no communication is allowed, no jobs can be sent to queues that are guaranteed to be short. The further specification of the limit indicates that K jobs are admitted into the system per message. This in turn reveals that when the communication is extremely infrequent, all messages result into finding an idle server, and thus provide the dispatcher with K passes to admit jobs.
Finally, Property (iv) is somewhat similar to Property (iii). When the queue limit K increases, one needs fewer messages in order to achieve a server utilization level a. With a = 1, Property (iv) shows that one message per K jobs is needed in order to achieve full server utilization, which is a somewhat similar conclusion as the one from Property (iii).

The hyper-scalable scheme
We now introduce the hyper-scalable scheme in full detail for the case of homogeneous servers.
At all times, the dispatcher remembers the most recent queue length that was reported by every server. Furthermore, the dispatcher records the number of jobs that have been sent to every server since the last update from that server. When the sum of these two numbers is strictly less than the queue limit K, a server is labeled open, and otherwise closed.
Whenever a job arrives to the dispatcher, it is assigned to an open server, if possible. There are two options for how to select an open server. Either an open server is selected uniformly at random (random case), or the open server that was interacted with (i.e. updated or received job) the longest ago is selected (FCFS case). The job is dropped when no open servers exist.
Exactly τ time units after a server was labeled closed, the dispatcher will request a queue length update of the server. The server becomes open when this queue length is strictly less than K, and the server remains closed for another τ time units if the queue length equals K, in which case the dispatcher will request the next queue length update after another τ time units. The hyper-scalable scheme is a dispatcher-driven algorithm, since only the dispatcher initiates messages and every server can track itself when it is labeled open by the dispatcher: exactly when the sum of the queue length during the latest update and the number of jobs received since then is strictly below K.
Note that by construction the hyper-scalable scheme respects the queue limit K at all times and involves a message rate of at most 1/τ . In addition, the scheme has been specifically designed to allow explicit analysis and derivation of provable capacity benchmarks. As it turns out, a crucial feature in that regard is for the servers to refrain from executing jobs while being marked open. This feature ensures that the queue length is exactly K at the moment a server becomes closed. The average number of job completions in an interval of length τ then equals M K (τ ), so one message leads to M K (τ ) admitted jobs on average, immediately yielding the following result. While the forced idling of servers during open periods may seem inefficient, the next theorem shows that the proposed hyper-scalable scheme is in fact throughput-optimal in large-scale systems, given the message rate δ and queue limit K, with the choice τ = 1/δ. Theorem 2. For any δ > 0, the throughput achieved by the hyper-scalable scheme with τ = 1/δ approaches min{λ * (δ, K), λ} as N → ∞.
Since the hyper-scalable scheme obeys the queue limit K and involves a message rate of at most δ, Theorems 1 and 2 combined imply that it is throughputoptimal as N → ∞.
According to Theorem 1 and Property (i) of Proposition 1, one would require a message rate of at least δ to achieve a throughput of λ * (δ, K). Theorem 2 shows that the throughput of the hyper-scalable scheme approaches λ * (δ, K) as N → ∞ when λ ≥ λ * (δ, K). A combination of these two observations (and the fact that λ * (δ, K) is continuous in δ) indicates that the message rate of the hyper-scalable scheme must approach δ as N → ∞ when λ ≥ λ * (δ, K). This in turn implies that the expected duration of an open period must become negligible, compare to the length τ of a closed period, i.e. the fraction of time that a server is marked open vanishes.
We now proceed with an outline of the proof of Theorem 2.
Analysis For brevity, a server is said to be in state k when the sum of the queue length at its latest update epoch and the number of jobs the server has received since, equals k. This means that all servers in state k < K are labeled open and servers in state K are labeled closed. In view of the homogeneity of the servers, it is useful to further introduce stands for the number of servers in state k at time t. While the vector N (t) provides a convenient representation, it is worth emphasizing that it does not provide a Markovian state description.
We now explain how individual servers transition between various states. When a job arrives to the system, the state of an open server will change from k < K to k + 1. An update of a server may cause the server to change state too. The new state of the server equals the number of jobs that are left in queue after the update interval of τ time units. The number of jobs that were served follows a truncated Poisson distribution, so the probability p k that exactly k jobs remain, equals p k := e −τ τ K−k (K−k)! for k > 0 and p 0 : When k < K jobs are left, the state of the server becomes k. When there are K jobs left, the state of the server does not change and remains K.
It is important to observe that service completions of jobs do not cause direct transitions in server states. The reason is twofold. When a server is open, it stops working on jobs, so there are no such completions at open servers. For closed servers, all servers are aggregated; the number of jobs in queue is not taken into account. Only after the period of length τ , the number of jobs left in queue is determined indirectly by using the transition probabilities as specified above.
Although the vector N (t) does not provide a Markovian state description as noted above, its evolution can be described in terms of a closed queueing network, in which the servers act as customers in the network, traversing various nodes corresponding to their states. Specifically, the closed queueing network consists of one multi-class "single-server" node with service rate λN in which the customers can be of classes 0, 1, . . . , K − 1, and one "infinite-server" node with deterministic service time τ that holds all class-K customers. A service completion at the single-server node makes one customer transition. The class of the customer changes from k to k +1 if k < K −1, or the customer transitions to the infinite-server node if its class was K − 1. When multiple customers are present at the single-server node, the customer that transitions is either selected uniformly at random (random case), or the customer that has been in the singleserver node for the longest time is selected (FCFS case). Finally, upon a service completion at the infinite-server node a customer moves to the single-server node as class k < K with probability p k , or directly re-enters the infinite-server node with probability p K .
A schematic representation is shown in Figure 2. We define γ k as the relative throughput value of class-k customers. With γ K = 1, it follows that γ k = p 0 + . . . + p k = 1 − α K−1−k (τ ) for k < K.
By virtue of the above-described equivalence, the process N (t) representing the server states under the hyper-scalable scheme inherits the product-form equilibrium distribution of the closed network as stated in the next proposition.
Proposition 2. The equilibrium distribution of the system with N servers is π(n 0 , n 1 , . . . , n K−1 , n K ) = G −1 if n 0 + . . . + n K = N , with normalization constant A proof of Proposition 2 can be found in Section 5, and the product-form equilibrium distribution may be informally understood as follows. The infiniteserver node allows a product-form distribution even for deterministic service times. While traditionally exponentially distributed service times are considered, the equilibrium distribution is insensitive to the service time distribution at the infinite-server node and only depends on its mean, see Section 5 for details. As mentioned above, the service discipline at the single-server node with exponentially distributed service times may either be FCFS or random order of service. In the case of the FCFS discipline, albeit not being reversible [15], the single-server node with multiple classes can be represented as an orderindependent queue [4,16]. According to Theorem 2.2 in [16], the queue is quasi-reversible, which is sufficient for a product-form distribution. For random order of service, which is a symmetric service discipline, the single-server node is reversible, yielding a product-form as well.
The equilibrium distribution (6) can be simplified when only the number of open and closed servers matters. This immediately yields an expression for the blocking probability L N as provided in the next corollary.
Theorem 2 allows us to equivalently view λ * (δ, K) as the throughput that is achieved by the hyper-scalable scheme as N → ∞ when λ ≥ λ * (δ, K). We now revisit properties (ii) and (iii) as stated in Proposition 1 from that perspective. In the limiting scenario δ → ∞, τ ↓ 0, servers are updated after an infinitesimally small time, which in turn alerts the dispatcher immediately when even a single job has been served. This ensures that all servers can work at full capacity.
In the scenario δ ↓ 0, τ → ∞, update periods become extremely long. Every update that does happen, will definitely find an idle server and allow for K admitted jobs, explaining why λ * (δ, K) ≈ Kδ for small δ.
Remark 1. Note that with the queue limit K in force we may assume each server to have a finite buffer of size K. In case of a finite buffer, the queue limit K would automatically be enforced, even if the dispatcher were allowed to forward jobs without any advance guarantee. With the option of "(semi)-blind guesses", the throughput bound would trivially become 1 (the average server speed), and Property (iii) indicates that the achievable throughput λ * (δ, K) without lucky guesses could be (substantially) lower when δ is (significantly) smaller than 1/K. However the throughput of 1 can only be approached for a high arrival rate, at the expense of severe blocking, whereas the hyper-scalable scheme can deliver the throughput λ * (δ, K) with negligible blocking asymptotically.

Simulation experiments and optimality benchmarks
In this section we conduct various simulation experiments to further benchmark the properties of the hyper-scalable scheme and make several comparisons. Throughout we set the queue limit K = 2, yielding the throughput bound λ * (δ, 2) = 2δ − 2δe −1/δ − e −1/δ as function of the message rate δ. Furthermore, all simulation results emulate the random case, i.e. a job is sent to an open server selected uniformly at random.

Baseline version of the hyper-scalable scheme
First, we evaluate the hyper-scalable scheme itself in Figures 3a and 3b for K = 2 and K = 3 respectively. We note that the message rate stays below the line y = 1/τ , confirming that it never exceeds 1/τ . The throughput and blocking probability achieved by the hyper-scalable scheme are nearly indistinguishable from the respective asymptotic values (upper and lower bounds, respectively), especially at lower and medium values of the communication budget 1/τ . For higher values of the communication budget, the throughput and blocking probability slightly diverge from the asymptotic values but remain remarkably close nevertheless. This demonstrates that the asymptotic optimality properties of the hyper-scalable scheme as stated in Theorems 1 and 2 already manifest themselves in moderately large systems.
In order to provide further insight in the asymptotic optimality, we compare the baseline version of the hyper-scalable scheme with several variants and alternative scenarios that are not analytically tractable.
Specifically, in the next two subsections, we examine the following variants through simulations: • "non-idling"; open servers continue working, but convey their queue length as if they had not been working while being open, • "work-conserving"; open servers continue working and convey their actual queue lengths at update epochs.
At first sight, one might suspect that these variants achieve a possibly larger throughput. As we will see however, the differences are small and are only observable at low load or in systems with few servers.
In Subsection 3.4 we make a comparison with the AUJSQ det (δ) scheme considered in [5], which is not analytically tractable either but seems to be asymp-  totically throughput-optimal as well. In particular, the equilibrium distribution of the server states as provided in Proposition 2 applies to the non-idling variant as well, and the throughput and the number of messages exchanged per admitted job are identical in both scenarios. The only difference arises in the expected queue lengths encountered by admitted jobs: they are somewhat smaller in the non-idling scenario, as illustrated by the simulation results presented in Figure 4a.

Non-idling variant
At low load values, there are instants where there is time for servers to execute jobs when they are open. This causes a distinction between the two variants, since in the non-idling variant jobs join shorter queues. In Section 4, we will consider a tractable extension of the hyper-scalable scheme that aims to reduce the queue lengths. As the number of servers grows however, an overflow of arrivals will cause open servers to have less time to execute jobs, which causes the queue lengths to be similar in both scenarios. This viewpoint provides further intuition why the hyper-scalable scheme is still asymptotically optimal.

Work-conserving variant
We now turn to a work-conserving variant of the hyper-scalable scheme, in which open servers also work on jobs, and in fact convey their actual queue length at an update epoch. In this case the evolution of the server states is different, and the equilibrium distribution provided in Proposition 2 no longer applies. The throughput and blocking probability are similar in both scenarios. This may be intuitively explained as follows. When λ ≥ λ * (1/2, 2), Theorem 2 shows that there are hardly ever any open servers, and hence there should not be any substantial difference between the two variants, which is corroborated by Figure  4b.
When λ < λ * (1/2, 2), there can be a significant number of open servers. Theorem 2 however implies that the hyper-scalable scheme approaches zero blocking and throughput λ in this case. While it is plausible that the work-conserving variant achieves that as well, as attested by Figure 4b, it is simply not feasible to achieve lower blocking or higher throughput. The only room for improvement is thus in the number of message exchanges per admitted job, and Figure 4b demonstrates that the work-conserving variant indeed provides some gain compared to the hyper-scalable scheme in that regard. To put that observation in perspective, consider Corollary 1. As one can see, the communication overhead is strictly decreasing in τ . For such a low arrival rate, the hyper-scalable scheme permits to choose the update interval τ much larger. Figure 5 confirms that the choice τ = 5 largely eliminates the difference in communication overhead between the work-conserving variant and the baseline version.

Comparison with the AUJSQ det (δ) scheme
We now compare the hyper-scalable scheme with the AUJSQ det (δ) scheme [5], which is somewhat similar, except that every server is updated exactly every τ = 1/δ time units based on a timer. Thus the AUJSQ det (δ) scheme might update servers even when they are known to have strictly less than K = 2 jobs in queue. There are further minor differences: in AUJSQ det (δ) jobs are assigned to the server with the lowest state (so giving preference to servers that are more likely to be empty) and open servers do work on jobs. In contrast to [5], we consider a variant of the AUJSQ det (δ) scheme in which jobs are blocked when the dispatcher is not aware of any servers that are guaranteed to have strictly less than K = 2 jobs in queue. The comparison is shown in Figure 6.
It is important to note that in the hyper-scalable scheme the expected number of messages per admitted job is independent of λ, while in the AUJSQ det (δ) scheme the expected number of messages per time unit is independent of λ. We observe that the average number of messages per admitted job coincides when λ > λ * (1/τ, K). While it is natural to expect that the AUJSQ det (δ) scheme offers similar asymptotic optimality properties, it lacks the mathematical tractability of the hyper-scalable scheme to facilitate a rigorous proof argument.

Non-exponential service times
We conclude our simulation experiments with analyzing the hyper-scalable scheme for non-exponential service time distributions. In Figure 7a, the service times are Gamma(2,2) distributed. The throughput of the hyper-scalable algorithm slightly exceeds λ * (1/τ, K), the maximum throughput when job sizes are exponential. The number of messages per admitted job is also lower than 1/M K (τ ). This is all explained by the fact that the tail of the Gamma(2,2) distribution is smaller than the tail of the exponential distribution. This means that more jobs are completed in a fixed time interval, which increases the effectiveness of the messages sent. The service time distribution in Figure 7b is Gamma(1/2,1/2). The opposite effect is observed: the throughput is lower compared to Figure 3a and the message rate is larger, because of the heavier tail.

Extension aimed at minimizing queue lengths
While the hyper-scalable scheme is asymptotically throughput-optimal given the message rate δ and queue limit K, it does not make any explicit effort beyond that to minimize queue lengths or delays experienced by jobs. Motivated by that observation, we now consider an extension of the hyper-scalable scheme aimed at minimizing waiting times. In this extension, a server that receives its i-th job after an update at which its queue length was k, becomes closed for τ k,k+i time units. After this time, it becomes open if k +i < K and it is updated if k + i = K. Thus, servers are not only closed when they become full, but are closed for a while after every job they receive.  Henceforth, we focus on the case K = 2 for the ease of exposition, and we set τ 0,0 = 0, τ 0,1 = τ 1,1 = τ 1 , τ 0,2 = τ 1,2 = τ 2 and τ 2,2 = τ 3 . We can put τ 0,0 to zero without loss of generality as it makes no sense to have a cool-down period for an empty server. As a consequence there is no difference between servers that had zero jobs or one job at the previous update epoch, so we can set τ 0,1 = τ 1,1 , and τ 0,2 = τ 1,2 as well. Let p 2j be the probability that j jobs remain after an update, when there were zero or one jobs just after the latest update epoch. This means that the server had τ 0,1 time units to work on the first job and another τ 0,2 time units after both jobs were dispatched to it. This gives p 20 = e −τ1 (1 − τ 2 e −τ2 − e −τ2 ) + (1 − e −τ1 )(1 − e −τ2 ), p 22 = e −τ1 e −τ2 and p 21 = 1 − p 20 − p 22 . Let q 2j be the probability that j jobs remain after an update, when there were two jobs just after the latest update epoch. This gives q 20 = 1 − e −τ3 − τ 3 e −τ3 , q 22 = e −τ3 and q 21 = 1 − q 20 − q 22 .
Servers can be in either of the five following states.
A 1 The server is idle and open.
The server had zero jobs during the previous update moment and received one job since, or the server had one job during the previous update moment and received no jobs since. The server is now marked closed for τ 1 time units.
The server had zero jobs during the previous update moment and received one job since, or the server had one job during the previous update moment and received no jobs since. The server was marked closed for τ 1 but is now open.
The server had zero jobs during the previous update moment and received two jobs since, or the server had one job during the previous update mo- ment and received one job since. The server is now marked closed for τ 2 time units.
The server had two jobs during the previous update moment and is now marked closed for τ 3 time units.
The transitions are schematically represented in Figure 8, with the transition probabilities as defined earlier.
The system dynamics under this extension of the hyper-scalable scheme can also be represented in terms of a closed queueing network with one single-server node that holds two classes of customers and three infinite-server nodes. The states A 1 and A 2 correspond to the two classes that customers can be of when they are present at the single-server node. The states B 1 , B 2 and B 3 each correspond to one of the three infinite-server nodes in the network, with deterministic service times τ 1 , τ 2 and τ 3 , respectively. Proposition 3. The equilibrium distribution of the system with N servers is π(n 1 , n 2 , m 1 , m 2 , m 3 ) if n 1 +n 2 +m 1 +m 2 +m 3 = N , with (γ 1 , γ 2 , κ 1 , κ 2 , κ 3 ) = (p 20 + p22q20 In particular, because of the PASTA property, the blocking probability is given by , and π(0, N ) → max{0, 1 − λ * (τ 1 , τ 2 , τ 3 )/λ} as N → ∞, which equals zero when λ ≤ λ * (τ 1 , τ 2 , τ 3 ) := γ1+γ2 κ1τ1+κ2τ2+κ3τ3 .
• The average queue position of an admitted job equals The last two statements follow directly from the relative throughput values. These exact expressions for the maximum throughput λ * , the average number of updates per admitted job u and the average queue position q of admitted jobs, allow us to evaluate the performance of this extension.
In Figure 9a, the value of τ 1 is varied while the values of τ 2 and τ 3 are kept constant. Since τ 1 represents the time that a server is closed when it has one job in queue, the result is that the second job that is sent to the server experiences a shorter queue in expectation. Indeed, for larger values of τ 1 , the mean experienced queue length q decreases. As a further benefit, the mean number of updates decreases as well, since an idle server will take at least τ 1 +τ 2 time units to be updated. The penalty incurred for these advantages is that the maximum throughput, λ * , drops below the value of λ * (δ, K) as asymptotically achieved by the baseline version of the hyper-scalable scheme, since servers may become idle during the τ 1 time in which they will not receive any more jobs.
Finally, in Figure 9b we show that a trade-off between the parameters is possible. τ 1 is increased while τ 2 is decreased, and this leads to interesting behavior. Around the point τ 1 = 0, the values of λ * and u do not change when the parameters are altered, but the value of q does change. Such a trade-off might be worth it in scenarios where mean queue lengths play an important role.

Closed queueing network and further proofs
In this section we establish product-form distributions for a general closed queueing network scenario which captures the network representations of the hyperscalable scheme and the extension considered in the previous section as special cases. This provides the proofs of Propositions 2 and 3.
The closed queueing network consists of N customers circulating among one single-server node and B infinite-server nodes. Customers can be of A classes while at the single-server node, denoted by A 1 , . . . , A A . Denote the infiniteserver nodes by B 1 , . . . , B B . The routing probabilities are denoted by p x→y ; this is the probability that a customer transitions from x to y (x and y may correspond to either a class or an infinite-server node).
Service completions at the multi-class single-server node occur at an exponential rate λN . The customer that completes service is either selected uniformly at random, or in a FCFS manner, where the next customer is the one that transitioned last. If the selected customer is of class i, then it immediately returns to the single-server node as a class-j customer with probability p Ai→Aj or it moves to node B j with probability p Ai→Bj . The service times at the infinite-server node B i are deterministic and equal to τ i . Upon completing service at node B i , a customer moves to the single-server node as a class-j customer with probability p Bi→Aj , or to node B j with probability p Bi→Bj .
The relative throughput values may be calculated from the traffic equations, where γ i stands for the relative throughput of class A i at the single-server node and κ i for the relative throughput at node B i . We assume a "single-chain network", where the routing probability matrix is irreducible, meaning that all customers can reach all classes and nodes.

Proposition 4.
(a) The equilibrium distribution of the system with N customers is π(n 1 , n 2 , . . . , n A , m 1 , m 2 , . . . , m B ) if n 1 + . . . + n A + m 1 + . . . + m B = N , with normalization constant where n i is the number of customers of class A i at the single-server node and m j the number of customers at infinite-server node B j .
(b) The equilibrium probability of there being n customers at the single-server node and N − n customers in total at all the infinite-server nodes equals In particular, with R = γ1+...+γ A κ1τ1+...+κ B τ B and x = λ/R, because of the PASTA property, the probability that no customer resides at the single-server node is and π(0, N ) → max{0, 1−R/λ} as N → ∞ which equals zero when λ ≤ R.
In order to prove Proposition 4, we will verify that the equilibrium distribution (10) satisfies the balance equations of the closed queueing network.

Proof of Proposition 4
In order to verify the balance equations, we may assume that the service times of the infinite-server nodes are exponentially distributed even though in our closed queueing network, the service times are deterministic. This is because the equilibrium distribution (10) is insensitive to the service time distribution of nodes and only depends on the means of them (see Chapter 3 of [16] for a further discussion on this).
To see this, consider one infinite-server node D with exponential service rate µ D and throughput value κ D . This node adds the term to the product-form equilibrium distribution, representing the presence of d customers in the infinite-server node. We now replace this infinite-server node by a series of infinite-server nodes, denoted by D 1 , . . . , D M , each with an exponential service rate M µ D . The transition probabilities are altered in such a way that every transition previously to node D, now transitions to node D 1 instead. Customers then transition from node D i to D i+1 for i = 1, . . . , D − 1 with probability one. Finally, any transition previously from node D, will now transition from node D M . This construction makes every customer stay in this collection of nodes for M exponentially distributed phases, which is an Erlang(M, M µ) distributed random variable. All other throughput values in the network remain equal.
The throughput values of all these nodes will be equal to κ D (since they are in series). Finally, similarly to the simplification of (10) to (11), all nodes D 1 , . . . , D M may be aggregated, which would lead to a term in the equilibrium probability, representing the presence of d customers in total in the infinite-server nodes D 1 , . . . , D M . Note that the term in the RHS exactly matches the term (12), that appears when the node D has an exponentially distributed service time. This shows that the equilibrium distribution does not change when an exponential node is replaced by an Erlang(M, M µ) node, for any integer M , which can also be verified by substitution in the balance equations. Of course, each infinite-server node B i with µ i = 1/τ i can be replaced by such an Erlang distribution using this construction. Because an Erlang(M, M µ) random variable converges to a deterministic quantity 1/µ as M tends to infinity, this indicates that the equilibrium distribution also holds with infinite-server nodes that have deterministic service times. In fact, the node D may be replaced by any phase-type distribution, and every distribution may be approximated arbitrarily closely by phase-type distributions, implying that the equilibrium distribution in (10) in fact holds for generally distributed service times with mean τ i at the infinite-server node B i as well, although that is not directly relevant for our purposes.
We will now verify that (10) indeed solves the balance equations for the random order of service case, and we will use µ i = 1/τ i , representing the rates of the infinite-server nodes. The proof for the FCFS case is quite similar, but involves a more detailed state representation, and is deferred to Appendix A.
Proof of part (a) -ROS. For conciseness, denote by (a, b) the vector (a 1 ,. . . , a A , b 1 , . . . , b B ) and by e i the i-th unit vector.
Note that (10) is a proper distribution by definition. Since the equilibrium distribution is unique, it suffices to verify that (10) satisfies the following set of balance equations: The first line of the RHS refers to transitions where a customer at the singleserver node transitions to the same node and may change class. The second line refers to transitions from the single-server node to one of the infinite-server nodes. Lines three and four correspond to transitions from a infinite-server node, to the single-server node or to another infinite-server node, respectively. We will show that (10) satisfies the balance equations. By using the definition of (10) in the RHS, we obtain Next, we combine the inside sums, resulting in

Conclusion
We established a universal upper bound for the achievable throughput of any dispatcher-driven algorithm for a given communication budget and queue limit. We also introduced a specific hyper-scalable scheme which can operate at any given message rate and enforce any given queue limit, while allowing the system dynamics to be captured via a closed product-form network. We leveraged the product-form distribution to show that the bound is tight, and that the proposed hyper-scalable scheme provides asymptotic optimality in the three-way tradeoff among performance, communication and throughput. Extensive simulation experiments were presented to illustrate the results and make comparisons with various alternative design options. The work-conserving variant covered in Subsection 3.3 is especially worth discussing further. Intuitively, letting servers work all the time seems better than pausing the servers when they become open, but this remains to be rigorously proven.
The extension aimed at minimizing waiting times that was introduced in Section 4 warrants further attention as well. For the baseline scenario, we were able to prove a strict relationship between the amount of communication and the throughput. Likewise, there might exist a result, similar in spirit to Theorem 1, which provides an upper bound for the throughput and the average queue position of admitted jobs, given a certain communication budget. The main point of concern in this regard is that the concavity argument no longer seems to hold.
Finally, it would be worth investigating whether the current framework could be broadened further. It may be possible for example to extend the category of algorithms considered, specifically allowing for pull-based schemes. While the results in [7] imply that Theorem 1 does not hold for pull-based schemes, there might be a larger upper bound covering such algorithms as well. For further extensions, other performance metrics might be considered too, such as the mean waiting time as opposed to the throughput subject to a queue limit.
customers are at the single-server node, and the order of the classes of customers is saved as well: the kth customer at the single-server node has class c k . We will sometimes refer to the number of customers of a specific class with a i = j 1 {cj =Ai} . Furthermore, b i customers are at the infinite-server node B i .
Equilibrium distribution for the extended state space. We will show that the equilibrium distribution (modulo normalization constant) of state (c, b) equals with a k the number of customers of class A k . We assume FCFS arrivals of customers at the single-server node: customers arrive at the end of the line at the single-server node and only the customer first in line is able to transition.
Balance equations. First, we introduce the balance equations, in which the symbol m is used to denote the length of vector c, The term before π(c, b) on the LHS represents the outgoing rate of state (c, b), which equals a rate of λN for the single-server node (if at least one customer is present there) plus a rate of b j µ j , for each infinite-server node B j .
On the RHS, four possible transitions to state (c, b) are shown preceded by the rate of the transitions. First, a transition from the non-empty single-server node makes the then first customer change its class from c m−1 to class c m . If the previous class order at the single-server node is c m − 1, c 1 , . . . , c m−1 , then a transition to that node will make the class order exactly c. Second, if the previous class order at the single-server node is K − 1, c 1 , . . . , c m , then a transition from that node to a infinite-server node will make the class order exactly c. Additionally, if the number of customers at infinite-server node B j was b j − 1, then it will become b j as the infinite-server node receives an extra customer. Third, any of the customers at the infinite-server nodes might transition to the single-server node. Finally, customers might transition from and to one of the infinite-server nodes.
We will show thatπ satisfies the balance equations. By using the definition ofπ in the RHS, we obtain Next, we reorganize terms, yielding which shows thatπ is the equilibrium distribution of the extended state space. Finally, note that in the original state space, only the number of customers of certain classes is tracked. Thus, π(a, b) is an enumeration ofπ(c, b) over all possible orders with the correct number of customers of a certain class. The number of possible orders is a1+...+a A a1...a A , which leads to π(a, b) = a1+...+a A a1...a A π(c, b); the description of π as presented in the statement of the proposition.