Analysis of cycle stealing with switching times and thresholds

doi:10.1016/j.peva.2004.09.003

Performance Evaluation

Volume 61, Issue 4, August 2005, Pages 347-369

https://doi.org/10.1016/j.peva.2004.09.003 Get rights and content

Abstract

We consider two processors, each serving its own M/GI/1 queue, where one of the processors (the “donor”) can help the other processor (the “beneficiary”) with its jobs, during times when the donor processor is idle. That is the beneficiary processor “steals idle cycles” from the donor processor. There is a switching time required for the donor processor to start working on the beneficiary jobs, as well as a switching back time. We also allow for threshold constraints on both the beneficiary and donor sides, whereby the decision to help is based not only on idleness but also on satisfying threshold criteria in the number of jobs.

We analyze the mean response time for the donor and beneficiary processors. Our analysis is approximate, but can be made as accurate as desired, and is validated via simulation. Results of the analysis illuminate principles on the general benefits of cycle stealing and the design of cycle stealing policies.

Introduction

Since the invention of networks of workstations, systems designers have touted the benefits of allowing a user to take advantage of machines other than her own, at times when those machines are idle. This notion is often referred to as cycle stealing. Cycle stealing allows such a user, Betty, with multiple jobs, to offload one of her jobs to the machine of a different user, Dan, if Dan’s machine is idle, giving Betty two machines to process her jobs. When Dan’s workload resumes, Betty must return to using only her own machine. We refer to Betty as the beneficiary, to her machine as the beneficiary machine/server, and to her jobs as beneficiary jobs. Likewise, we refer to Dan as the donor, to his machine as the donor machine/server, and to his jobs as donor jobs.

Although cycle stealing provides obvious benefits to the beneficiary, these benefits come at some cost to the donor. For example, the beneficiary’s job may have to be checkpointed and the donor’s working set memory reloaded before the donor can resume, delaying the resumption of processing of donor jobs. In our model we refer to these additional costs associated with cycle stealing as switching times.

A primary goal of this paper is to understand the benefit of cycle stealing for the beneficiary and the penalty to the donor, as a function of switching times. A secondary goal is to derive parameter settings for cycle stealing. In particular, given non-zero switching times, cycle stealing may pay only if the beneficiary’s queue is “sufficiently” long. We seek to understand the optimal threshold on the beneficiary queue when switching to help, and the optimal threshold on the donor queue when switching back. More broadly, we seek general insights into which system parameters have the most significant impact on the effectiveness of cycle stealing.

We assume there are two queues, the beneficiary queue and the donor queue, with independent arrival processes and service time distributions operating as M/GI/1/FCFS queues. Jobs arrive at rate $λ_{B}$ (respectively, $λ_{D}$ ) at the beneficiary (respectively, donor) queue and have service requirement $X_{B}$ with distribution $G_{B}$ (respectively, $X_{D}$ with distribution $G_{D}$ ). The load made up by beneficiary (respectively, donor) jobs is denoted by $ρ_{B}$ (respectively, $ρ_{D}$ ) where $ρ_{B} = λ_{B} E [X_{B}]$ and $ρ_{D} = λ_{D} E [X_{D}]$ . If the donor server is idle, and if the number of jobs at the beneficiary queue is at least $N_{B}^{th}$ , the donor transitions into the switching state, for a random amount of time, $K_{sw}$ . After $K_{sw}$ time, the donor server is available to work on the beneficiary queue and the beneficiary queue becomes an M/ $G_{B}$ /2 queue. When the number of donor jobs in queue reaches $N_{D}^{th}$ (either during $K_{sw}$ , or during the time the donor is helping the beneficiary), the donor transitions into a switching back state, for a random amount of time, $K_{ba}$ . After the completion of the switch back, the donor server resumes working on its own jobs until the donor queue is empty. The donor server cannot work on any job while the donor is in the switching or switching back state.

A few details are in order. First, in the above model, the donor processor continues to cooperate with the beneficiary even if there is no beneficiary work left for it to do—the donor processor can switch back only when a donor job arrives.¹ Second, we assume that if the donor processor is working on a beneficiary job and a donor job arrives, that beneficiary job is returned to the beneficiary queue and will be resumed by the beneficiary processor as soon as that processor is available. The work done on the job is not lost (i.e. preemptive resume).² Our performance metric throughout is mean response time (overall and for each class of jobs), where the response time of a job is the time from when the job arrives until it completes service. We assume the first three moments of the service times are finite, and queues are stable.

Consider the simplest instance of our problem—where the service requirements of all jobs are drawn from exponential distributions and the switching times and thresholds are zero. Even for this simplest instance the continuous time Markov chain, while easy to describe, appears computationally intractable. This is due to the fact that the stochastic process having state (number of beneficiary jobs, number of donor jobs) grows infinitely in two dimensions and contains no structure that can be easily exploited in practice to obtain an exact solution. While solution by truncation of the Markov chain is possible, the errors that are introduced by ignoring portions of the state space (infinite in two dimensions) can be significant, especially at higher traffic intensities. Thus truncation is neither sufficiently accurate nor robust for our purposes.

To our knowledge, there has been no previous analytical work on cycle stealing with switching times and thresholds. The analysis of cycle stealing without switching times and thresholds under exactly our model has been studied by Foley and McDonald [5]; they prove asymptotic, heavy-traffic bounds on the performance of cycle stealing under exponential job size distributions.

A related model to our cycle stealing model is the “coupled processor model,” which we elaborate below. In this model two processors each serve their own class of jobs, and if either is idle it may help the other, increasing the rate of the other processor. This help incurs no switching time and has a benefit even if only a single job is present (i.e. two processors can work on the single job). However, because the processors work in concert, rather than on different jobs, these systems will gain no multi-server benefit when serving highly variable jobs; short jobs may get stuck waiting behind long jobs in the single queue for each class. All works we mention below consider Poisson arrivals.

Early work on the coupled processor model includes papers by Fayolle and Iasnogorodski [4] and Konheim et al. [8]. Both papers assume exponential service times, deriving expressions for the stationary distribution of the number of jobs of each type. Fayolle and Iasnogorodski use complex algebra, eventually solving either a Dirichlet boundary value problem or a Riemann-Hilbert boundary value problem, depending on the accelerated rates of the servers. Konheim et al. assume that the accelerated rate is twice the original rate, which yields simpler expressions (still in the form of complex integrals). While it is possible to numerically evaluate these analytical expressions, they were not evaluated in either work; thus no intuition was provided on the performance of these systems.

The above work is extended by Cohen and Boxma [3] to the case of general service times. They consider stationary workload, which they formulate as a Wiener-Hopf boundary value problem. This leads to expressions involving either integrals or infinite sums; if the queues are symmetric simpler expressions for mean total workload are found, but not for mean response time. They again have the two processors working in concert, without a switching time, providing analytical expressions, rather than numerical values.

In more recent work, Borst et al. [2] apply a transform method to the expressions in [3], yielding asymptotic relations between the workloads and the service requirement distributions. This leads to the insight that if a processor has a load less than one, it is “insulated” from the heavy-tail of the other, as long periods without cooperation will not lead to large backlogs. This is not the case if the load is greater than one, as the queue now must rely on help to be stable. Borst et al. [1] consider a similar question under a related model of generalized processor sharing, where n classes of jobs can share a processor with arbitrary weights. Using probabilistic bounds, they show that different service rates can either insulate the performance of different classes from the others or not, again depending on whether the non-cooperative load is larger than one. Both of these papers are concerned with the asymptotic behavior of workload, whereas our work isolates mean response time. Our work is thus complementary to these results.

This paper presents the first analysis of cycle stealing under general service requirements with switching times and thresholds. Recall that the difficulty in analyzing cycle stealing is that the corresponding stochastic process has state space that grows infinitely in two dimensions (2D), making it computationally intractable. The key idea in our approach is to find a way to transform this 2D-infinite Markov chain into a Markov chain that is infinite in only one dimension (1D-infinite Markov chain), which can be analyzed efficiently. The questions in applying such a transformation are (i) what should the 1D-infinite Markov chain track, and (ii) how can all the relevant information from the 2D chain be captured in the 1D-infinite Markov chain. Our 1D-infinite Markov chain tracks the number of beneficiary jobs, yielding measures of beneficiary performance. For the donor jobs, our state space contains only limited knowledge, $0, 1, \dots, N_{D}^{th}$ , or $\geq$ $N_{D}^{th}$ jobs. Nevertheless we are able to capture all necessary information by using special transitions in our Markov chain, where these transitions represent the lengths of an assortment of busy periods. We refer to this technique as dimensionality reduction. The difficulty in dimensionality reduction lies in specifying the right busy periods, some of which transcend the definition of the analytical model.

Once the 1D-infinite Markov chain is specified, the hard work is finished, since the limiting probabilities in this chain can be solved efficiently using known numerical (matrix analytic) techniques, which in turn gives mean response time for beneficiary jobs. The mean response time of donor jobs is analyzed as an M/GI/1 queue with generalized vacations, and all the necessary information is provided by the limiting probabilities of the 1D-infinite Markov chain. While a closed-form solution may be preferable, our chain is compact enough, and matrix analytic methods powerful enough, that only a couple of seconds are required to generate most of the results plots in this paper. Furthermore, our method generalizes to more complex models, e.g. multiple donors (Section 7).

Our analysis is approximate but can be made as accurate as desired. The primary approximation lies in representing the length of the busy periods by phase type (PH) distributions. The beneficiary service requirement ( $X_{B}$ ) and the switching time to help ( $K_{sw}$ ) must also be approximated by PH distributions, although $X_{D}$ and $K_{ba}$ can be general. In this paper, we use a PH distribution, shown in Fig. 1, to capture the first three moments of the busy periods, and verify that this is sufficient via simulation (see Section 5).³ For $X_{B}$ and $K_{sw}$ , we assume throughout the paper that they have PH distributions, and hence no approximation is involved.

Our analysis yields many interesting results concerning cycle stealing, detailed in Sections 4 Stability, 6 Results of analysis. While cycle stealing obviously benefits the beneficiaries (beneficiary jobs) and hurts the donors (donor jobs), we find that when $ρ_{B} > 1$ , cycle stealing is profitable overall even under significant switching times, as it may ensure stability of the beneficiary queue. For $ρ_{B} < 1$ , we define load-regions under which cycle stealing pays. We find that in general the switching time is only prohibitive when it is large compared with $E [X_{D}]$ . Under zero switching time, cycle stealing always pays.

Two counterintuitive results are that when $ρ_{B} < 1$ , the mean response time of the beneficiaries is surprisingly insensitive to the switching time, and also insensitive to the variability of the donor job size distribution. Even when the variability of the donor job sizes is very high, and donor help thus is very bursty, the beneficiaries still enjoy significant benefits.

Our analysis also allows us to investigate characteristics of the beneficiary and donor side thresholds, $N_{B}^{th}$ and $N_{D}^{th}$ , both with respect to their impact on stability and their impact on mean response time. With respect to beneficiary stability, we find that $N_{B}^{th}$ has no effect, while increasing $N_{D}^{th}$ increases the stability region. Donor stability is not affected by either threshold. With respect to overall mean response time, we find that mean response time is far more sensitive to changes in $N_{D}^{th}$ than to changes in $N_{B}^{th}$ . This extends the results of [12], where we study only the effect that $N_{B}^{th}$ has on mean response time. We find the optimal value of $N_{B}^{th}$ tends to be well above 1. The reason is that increasing $N_{B}^{th}$ does not appreciably diminish beneficiary gain, but it does alleviate donor pain. We find that the optimal setting of $N_{B}^{th}$ is an increasing function of $ρ_{B}$ , $ρ_{D}$ , and switching times. By contrast, we find that the optimal value of $N_{D}^{th}$ is often close to 1, provided $ρ_{B} < 1$ . Increasing $N_{D}^{th}$ significantly hurts the donor, although it may provide significant help to the beneficiary if $ρ_{B}$ is high. We find that the optimal $N_{D}^{th}$ is not a monotonic function of $ρ_{D}$ , but is an increasing function of $ρ_{B}$ and switching times.

Our model deals with switching times in a general way, making the results both applicable to a shared-memory multiprocessor architecture and to a network of workstations (NOW). Our switching times, $K_{sw}$ and $K_{ba}$ , can be viewed as the time for switching between job types in a shared-memory multiprocessor architecture. In a NOW, there is an additional overhead incurred from migrating jobs from one server to another, where jobs which have not started running require negligible overhead, whereas migrating a “running” job (in progress) requires high overhead, since all of its state must be migrated with it. Our model captures this notion in that the switching back time, $K_{ba}$ , can be viewed as the time incurred for preempting an already running job and returning it to the beneficiary.

Section snippets

Analysis of beneficiary mean response time

In this section, we analyze the mean response time of beneficiary jobs. The key idea in the analysis is to reduce a 2D-infinite Markov chain to a 1D-infinite Markov chain. In Section 2.1 we illustrate this dimensionality reduction technique via the analysis of the simplest case with zero switching times (i.e. $K_{sw} = K_{ba} = 0$ ) and threshold values fixed at 1 (i.e. $N_{B}^{th} = N_{D}^{th} = 1$ ). In Section 2.2 we extend the analysis to the general case of nonzero switching times and arbitrary threshold values.

Analysis of donor mean response time

In this section, we analyze the mean response time of donor jobs. The donor queue is analyzed as an M/GI/1 queue with generalized vacations [6], where we use the limiting probabilities that we compute in Section 2. The following theorem gives a way to calculate the mean response time of donor jobs:

Theorem 1

The mean response time of donor jobs, $E [T_{D}]$ , is given by $E [T_{D}] = E [X_{D}] + \frac{λ_{D} E [X_{D}^{2}]}{2 (1 - ρ_{D})} + \frac{(N_{D}^{th} (N_{D}^{th} - 1) + 2 N_{D}^{th} λ_{D} E [K_{ba}] + λ_{D}^{2} E [K_{ba}^{2}]) p}{2 λ_{D} ((N_{D}^{th} + λ_{D} E [K_{ba}]) p + (1 - p))},$ where $p = \frac{Pr (Region C_{N_{D}^{th} - 1}) + Pr (Region S_{N_{D}^{th} - 1})}{Pr (}$

Stability

In this section, we derive stability conditions for cycle stealing with switching times and thresholds. We find for example that the stability condition for the donor jobs remains $ρ_{D} < 1$ , regardless of whether the donor jobs experience switching times. By contrast, the stability region for the beneficiary jobs can be significantly below $2 - ρ_{D}$ , specifically because the switching time erodes the beneficiary stability region. Also, interestingly, the value of $N_{B}^{th}$ does not affect the stability region

Validation of analysis

Since our analysis involves approximation of busy periods by PH distributions, it is of paramount importance to validate the analysis. Analytical validation against limiting load cases is presented in Section 5.1, and simulation validation over a range of load is reported in Section 5.2.

Results of analysis

This section discusses our results. Throughout we will use the term “gain” to denote the improvement (drop) in mean response time experienced by beneficiary jobs under cycle stealing, as compared with dedicated servers, and the term “pain” to refer to the increase in mean response time experienced by donor jobs under cycle stealing as compared with dedicated servers: $gain = \frac{E {[T_{B}]}^{Dedicated}}{E {[T_{B}]}^{CS}} and pain = \frac{E {[T_{D}]}^{CS}}{E {[T_{D}]}^{Dedicated}},$ where $E {[T_{B}]}^{Dedicated}$ refers to the mean response time of beneficiaries

Extensions and current work

This paper analyzes the mean response time under cycle stealing with switching times and thresholds, presenting many insights into the characteristics and performance of cycle stealing. Our analytical approach can be applied to other variants of cycle stealing models [11]. For example, we do not need to require that the donor switches back immediately when $N_{D}^{th}$ is reached; we can allow the donor to first complete the beneficiary job in progress. Completing the beneficiary job obviates the need

Acknowledgements

This work was supported by NSF Career Grant CCR-0133077, by NSF ITR Grant 99-167 ANI-0081396, and by IBM via PDG Grant 2003.

Takayuki Osogami is a Ph.D. candidate at the Department of Computer Science, Carnegie Mellon University, under the direction of Mor Harchol-Balter. He received a B.Eng. degree in Electronic Engineering from the University of Tokyo, Japan, in 1998. In 1998-2001, he was at IBM Tokyo Research Laboratory, where the principal project was development of optimization algorithms. His current research interests include performance modeling and analysis of scheduling and resource allocation policies

References (13)

S. Borst et al.
Reduced-load equivalence and induced burstiness in GPS queues with long-tailed traffic flows
Queueing Syst.
(2003)
S. Borst et al.
The asymptotic workload behavior of two coupled queues
Queueing Syst.
(2003)
J. Cohen et al.
Boundary Value Problems in Queueing System Analysis
(1983)
G. Fayolle et al.
Two coupled processors: the reduction to a Riemann-Hilbert problem
Zeitschrift fur Wahrscheinlichkeitstheorie und vervandte Gebiete
(1979)
R.D. Foley et al.
Exact asymptotics of a queueing network with a cross-trained server
S. Fuhrmann et al.
Stochastic decompositions in the M/G/1 queue with generalized vacations
Operations Res.
(1985)

There are more references available in the full text version of this article.

Cited by (17)

Dynamic non-preemptive re-allocation policies between two sites with reconfigurable servers
2007, Simulation Modelling Practice and Theory
Dynamic server re-allocation can be very useful in real life computing applications. Since the load on many computing systems is not uniformly distributed to each server, it may be effective to transfer the less loaded servers to help the other more loaded ones. However, since transferring takes time, it may not be profitable to actually make the transfer. In this study we model this case with two queues. Each queue is served by one server which can be re-allocated, i.e. an operator may decide to switch a server to serve the other queue. The re-allocation policies we examine are non-preemptive, which implies that a server can be re-allocated if it is idle or has just served a customer. The model is studied with respect to the average cost criterion. We find the optimal re-allocation policy for various instances of the parameters. In addition, we provide a heuristic policy and use simulation experiments to compare it with the optimal one as well as the policy that uses no re-allocation at all.
How many servers are best in a dual-priority M / PH / k system?
2006, Performance Evaluation
Citation Excerpt :
As mentioned in Section 2, the Markov chain for our system grows infinitely in two dimensions, which makes analysis via standard techniques problematic. The PH service times further complicate the model, invalidating dimensionality reduction techniques such as that in [25]. We explain this below in Section 4.1.
We ask the question, “for minimizing mean response time (sojourn time), which is preferable: one fast server of speed 1, or $k$ slow servers each of speed $1 / k$ ?” Our setting is the $M / PH / k$ system with two priority classes of customers, high priority and low priority, where PH is a phase-type distribution. We find that multiple slow servers are often preferable, and we demonstrate exactly how many servers are preferable as a function of the load and service time distribution. In addition, we find that the optimal number of servers with respect to the high priority jobs may be very different from that preferred by low priority jobs, and we characterize these preferences. We also study the optimal number of servers with respect to overall mean response time, averaged over high and low priority jobs. Lastly, we ascertain the effect of the service demand variability of high priority jobs on low priority jobs.
Transportation Polytope and its Applications in Parallel Server Systems
2021, arXiv
Covert Cycle Stealing in a Single FIFO Server
2021, ACM Transactions on Modeling and Performance Evaluation of Computing Systems
Covert cycle stealing in a single fifo server (extended version)
2020, arXiv
Service Center Staffing with Cross-Trained Agents and Heterogeneous Customers
2019, Production and Operations Management

View all citing articles on Scopus

Mor Harchol-Balter is an Associate Professor of Computer Science at Carnegie Mellon University. She received her doctorate from the Computer Science department at the University of California at Berkeley under the direction of Manuel Blum. She is a recipient of the McCandless Chair, the NSF CAREER award, the NSF Postdoctoral Fellowship in the Mathematical Sciences, multiple best paper awards, and several teaching awards, including the Herbert A. Simon Award for Teaching Excellence. Professor Harchol-Balter is heavily involved in the ACM SIGMETRICS research community. Her work focuses on designing new scheduling/resource allocation policies for various distributed computer systems including Web servers, distributed supercomputing servers, networks of workstations, and database systems. Her work spans both analysis and implementation and emphasizes integrating measured workload distributions into the problem solution.

Alan Scheller-Wolf is an Associate Professor of Manufacturing and Operations Management at the Tepper School of Business, Carnegie Mellon University. He recieved his doctorate from the Industrial Engineering and Operations Research Department of Columbia University, under the supervision of Professor Karl Sigman. Professor Scheller-Wolf works on problems in production and inventory theory, computer science, stochastic processes, and queueing theory. He is on the editorial board of IIE Transactions, Management Science, M&SOM, Operations Research, and Operations Research Letters. He has also completed consulting projects with The American Red Cross, Caterpillar, John Deere, and Intel.

View full text

Analysis of cycle stealing with switching times and thresholds

Abstract

Introduction

Section snippets

Analysis of beneficiary mean response time

Analysis of donor mean response time

Stability

Validation of analysis

Results of analysis

Extensions and current work

Acknowledgements

Reduced-load equivalence and induced burstiness in GPS queues with long-tailed traffic flows

Queueing Syst.

The asymptotic workload behavior of two coupled queues

Queueing Syst.

Boundary Value Problems in Queueing System Analysis

Two coupled processors: the reduction to a Riemann-Hilbert problem

Zeitschrift fur Wahrscheinlichkeitstheorie und vervandte Gebiete

Exact asymptotics of a queueing network with a cross-trained server

Stochastic decompositions in the M/G/1 queue with generalized vacations

Operations Res.