Elsevier

Parallel Computing

Volume 29, Issue 9, September 2003, Pages 1121-1152
Parallel Computing

Scheduling divisible workloads on heterogeneous platforms

https://doi.org/10.1016/S0167-8191(03)00095-4Get rights and content

Abstract

In this paper, we discuss several algorithms for scheduling divisible workloads on heterogeneous systems. Our main contributions are (i) new optimality results for single-round algorithms and (ii) the design of an asymptotically optimal multi-round algorithm. This multi-round algorithm automatically performs resource selection, a difficult task that was previously left to the user. Because it is periodic, it is simpler to implement, and more robust to changes in the speeds of the processors and/or communication links. On the theoretical side, to the best of our knowledge, this is the first published result assessing the absolute performance of a multi-round algorithm. On the practical side, extensive simulations reveal that our multi-round algorithm outperforms existing solutions on a large variety of platforms, especially when the communication-to-computation ratio is not very high (the difficult case).

Introduction

Scheduling computational tasks on a given set of processors is a key issue for high-performance computing. In this paper, we restrict our attention to the processing of independent tasks whose size (and number) are a parameter of the scheduling algorithm. This corresponds to the divisible load model which has been widely studied in the last several years, and popularized by the landmark book written by Bharadwaj et al. [1]. A divisible job is a job that can be arbitrarily split in a linear fashion among any number of processors. This corresponds to a perfectly parallel job: any subtask can itself be processed in parallel, and on any number of processors. The applications of the divisible load model encompass a large spectrum of scientific problems, including among others Kalman filtering [2], image processing [3], video and multimedia broadcasting [4], [5], database searching [6], [7], and the processing of large distributed files [8] (see [1] for more examples).

On the practical side, the divisible load model provides a simple yet realistic framework to study the mapping of independent tasks on heterogeneous platforms. The granularity of the tasks can be arbitrarily chosen by the user, thereby providing a lot of flexibility in the implementation tradeoffs. On the theoretical side, the success of the divisible load model is mostly due to its analytical tractability. Optimal algorithms and closed-form formulas exist for the simplest instances of the divisible load problem. This is in sharp contrast with the theory of task graph scheduling, which abounds in NP completeness theorems [9], [10] and in inapproximability results [11], [12].

In this paper, the target computing platform is a heterogeneous master/worker platform, with p worker processes running on p processors labeled P1,P2,…,Pp. The master P0 sends out chunks to workers over a network: we can think of a star-shaped network, with the master in the center. The master uses its network connection in exclusive mode: it can communicate with a single worker at any time-step. There are different scenarios for the workers, depending whether they can compute while receiving from the master (full overlap) or not. The overlap model is widely used in the literature, because it seems closer to the actual characteristics of state-of-the-art computing resources (but we point out that our results extend to both models, with and without overlap). For each communication of size αi between the master and a worker, say Pi, we pay a latency gi and a linear term αiGi, where Gi is the inverse of the bandwidth of the link between the master P0 and Pi. In the original model of [1], all the latencies gi are equal to zero, hence a linear cost model. However, latencies play an important role in current architectures [13], and more realistic models use the affine cost gi+αiGi for a message of size αi. Finally, note that when gi=g and Gi=G for 1⩽ip, the star network can be viewed as a bus oriented network [2].

The master processor can distribute the chunks to the workers in a single round, (also called installment in [1]), so that there will be a single communication between the master and each worker. This is the simplest situation, but surprisingly the optimal solution for a heterogeneous star network is not known, even for a linear cost model. We provide the optimal solution in Section 4, thereby extending the results of [2] for bus oriented networks to heterogeneous platforms.

For large workloads, the single round approach is not efficient, because of the idle time incurred by the last processors to receive their chunks. To minimize the makespan, i.e. the total execution time, the master will send the chunks to the workers in multiple rounds: the communications will be shorter (less latency) and pipelined, and the workers will be able to compute the current chunk while receiving data for the next one. Deriving an efficient solution becomes a challenging problem: how many rounds should be scheduled? what is the best size of the chunks for each round? Intuitively, the size of the chunks should be small in the first rounds, so as to start all the workers as soon as possible. Then the chunk size should increase to a steady state value, to be determined so as to optimize the usage of the total available bandwidth of the network. Finally the chunk size should be decreased while reaching the end of the computation. In Chapter 10 of [1], there is no quantified value provided for the number of rounds to be used. Recently, Altilar and Paker [4], [5], and Yang and Casanova [14] have introduced multi-round algorithms and analytically expressed their performance. We discuss these algorithms, and others, in Section 3, which is devoted to related work. To the best of our knowledge, no optimality result has ever been obtained for multi-round algorithms on heterogeneous platforms. The most important result of this paper is to fill this gap: in Section 5, we design a periodic multi-round algorithm and we establish its asymptotic optimality. We succeed in extending this result to arbitrary platform graphs, i.e. not just star-shaped network, but arbitrary graphs with cycles and multiple paths (see Appendix A).

The rest of the paper is organized as follows. We begin with models for computation and communication costs in Section 2. Next we review related results in Section 3. Then we deal with single-round algorithms in Section 4. We proceed to multi-round algorithms in Section 5. Because of its technical nature, the extension of the asymptotically optimal multi-round algorithms to arbitrary platforms graphs is described in the Appendix A. We provide some simulations in Section 6. Finally, we state some concluding remarks in Section 7.

Section snippets

Models

As already said, we assume a total workload Wtotal that is perfectly divisible into an arbitrary number of pieces, or chunks. Usually, it is assumed that the master itself has no processing capability, because otherwise we can add a fictitious extra worker paying no communication cost to simulate the master. There is a wide acceptance in the literature on using linear costs to model computation costs. Worker Pi will require αiwi time units to process a chunk of size αi. However, Yang and

Related results

We divide this overview into two categories: results for single-round algorithms, and results for multi-round algorithms. We restrict ourselves to master/worker platforms, which includes bus-oriented and star-shaped networks. See [1] for results on processor trees and [6] for hypercubes.

New results for single-round algorithms

In this section, we propose a new proof method for the optimal distribution of the work to the processors in single-round algorithms. This approach enables us to retrieve some well known results, and to establish new ones.

The approach is based upon the comparison of the amount of work that is performed by the first two workers. To simplify notations, assume that P1 and P2 have been selected as the first two workers. There are two possible orderings, as illustrated in Fig. 2. For each ordering,

Asymptotically optimal multi-round algorithms

In this section, we derive asymptotically optimal algorithms for the multi-round distribution of divisible tasks, when slave processors are either able or not to overlap their processing with incoming communications.

Simulations

In order to evaluate our multi-round algorithm, we have crafted a simulation with the SimGrid simulator [23], [24]. One major interest of relying on SimGrid is that all machine and network characteristics used in the simulations correspond to realistic values taken from the SimGrid database. We detail below the platforms that we have simulated.

In the experiments, we let the total workload size Wtotal vary in terms of workload units (or tasks) whose number range from 100 to 2000 by step of 100.

Conclusion

On the theoretical side, the main result of this paper is the proof of the asymptotic optimality of our multi-round algorithm. This is the first quantitative result ever assessed for a multi-round algorithm. But (maybe more importantly), our algorithm exhibits a lot of interesting features that make it a candidate of choice in a wide variety of situations:

  • The best selection of the resources to be used among all available machines is automatically conducted through the linear program. Even

References (25)

  • C. Lee et al.

    Parallel image processing applications on a network of workstations

    Parallel Computing

    (1995)
  • J. Blazewicz et al.

    Divisible task scheduling––concept and verification

    Parallel Computing

    (1999)
  • T. Hagerup

    Allocating independent tasks to parallel processors: an experimental study

    Journal of Parallel and Distributed Computing

    (1997)
  • V. Bharadwaj et al.

    Scheduling Divisible Loads in Parallel and Distributed Systems

    (1996)
  • J. Sohn et al.

    Optimizing computing costs using divisible load analysis

    IEEE Transactions on parallel and distributed systems

    (1998)
  • D. Altilar et al.

    An optimal scheduling algorithm for parallel video processing

  • D. Altilar et al.

    Optimal scheduling algorithms for communication constrained parallel processing

  • M. Drozdowski, Selected problems of scheduling tasks in multiprocessor computing systems, Ph.D. Thesis, Instytut...
  • R. Wang et al.

    Modeling communication pipeline latency

  • M.R. Garey et al.

    Computers and Intractability, a Guide to the Theory of NP-Completeness

    (1991)
  • H. El-Rewini et al.

    Task Scheduling in Parallel and Distributed Systems

    (1994)
  • Cited by (0)

    A shorter version of this paper appears in the 2003 Heterogeneous Computing Workshop, IEEE Computer Society Press.

    View full text