Alternative, uniformly expressive and more scalable interfaces for collective communication in MPI

doi:10.1016/j.parco.2011.10.009

Parallel Computing

Volume 38, Issues 1–2, January–February 2012, Pages 26-36

https://doi.org/10.1016/j.parco.2011.10.009 Get rights and content

Abstract

In both the regular and the irregular MPI (Message-Passing Interface) collective communication and reduction interfaces there is a correspondence between the argument lists and certain MPI derived datatypes. As a means to address and alleviate well-known memory and performance scalability problems in the irregular (or vector) collective interface definitions of MPI we propose to push this correspondence to its natural limit, and replace the interfaces of the MPI collectives with a different set of interfaces that specify all data sizes and displacements solely by means of derived datatypes. This reduces the number of collective (communication and reduction) interfaces from 16 to 10, significantly generalizes the operations, unifies regular and irregular collective interfaces, makes it possible to decouple certain algorithmic decisions from the collective operation, and moves the interface scalability issue from the collective interfaces to the MPI derived datatypes. To complete the proposal we discuss the memory scalability of the derived datatypes and suggest a number of alternative datatypes for MPI, some of which should be of independent interest. A running example illustrates the benefits of this alternative set of collective interfaces. Implementation issues are discussed showing that an implementation can be undertaken within any reasonable MPI library implementation.

Highlights

► Significantly extends the expressivity of MPI collective. ► Contributes toward solving known scalability problems of irregular MPI collectives. ► Introduces alternative set of datatype constructors, some new to MPI.

Introduction

It is well acknowledged that the MPI interfaces for the so called irregular (or vector) collective operations [7, Chapter 5] (like MPI_Alltoallw, to mention the most extreme example) suffer from scalability problems in that multiple argument lists of size p (the number of MPI processes in the communicator on which the operation is applied) have to be supplied in the calls, often in application contexts where a considerable amount of regularity and/or redundancy exists, e.g., a large fraction of count values of zero or semi-regular displacement values [2]. At the same time, despite their general appearance, the irregular collective interfaces pose certain limitations as to which data placements can be specified, limiting their use in situations where a derived datatype description of the data placements would be a natural choice. Such an observation previously motivated the generalization of the irregular MPI 1.0 MPI_Alltoallv operation to an additional MPI_Alltoallw operation in MPI 2.0 which in addition to counts and displacements takes lists of possibly different send and receive datatypes as arguments. Curiously, the same generalization was not undertaken for the other irregular collectives, one argument being that MPI_Alltoallw allows the expression of any possible exchange pattern. Although correct, this argument misses the point that for the more specialized operations much more efficient algorithmic realizations exist.

In the MPI design of the collective interfaces there is an intended, clear correspondence between the specification of send and receive buffers and certain, derived (or user-defined) datatypes [7, Chapter 4]. For instance, the data received in the regular MPI_Gather operation are stored as a contiguous block of data blocks, with the block from process i stored as the ith block. Each block is equivalent to a contiguous datatype of the given basetype, and the whole set of blocks to be received can again be described as a contiguous datatype of p blocks. For the irregular MPI_Gatherv operation, the received data are placed as described by an indexed type with p indices of contiguous blocks of the same basetype. Exactly this correspondence with the indexed type is limiting the expressivity of MPI_Gatherv. For instance, interleaving (or tiling) the data received from the p processes is tedious because the indices of the indexed type define displacements in units of the extent of the basetype. To achieve the interleaving effect, a new, resized type with extent equal to the tile size would have to be created. Gathering data of different, unrelated basetypes is not possible at all with this collective operation.

This paper proposes to push the correspondence between the arguments to the collective operations and the MPI derived datatypes to its natural limit. We propose to replace all specifications of collective communication buffers that in MPI are in terms of base addresses, counts, displacements and basetypes with only a base address and a datatype. All structure, including the repetition count of the data to be sent and received will be encapsulated in the datatype. This has the advantage of unifying the interfaces for regular and irregular collective operations, resulting in only one interface for each communication pattern and thereby reducing the overall number of collective interfaces in MPI from currently 16 to 10 (excluding MPI_Barrier). It gives full generality and similar expressivity to all collective operations, preserves the spirit of MPI of explicit distinction between regular and irregular communication patterns and can thus be implemented with the same, small overhead. Specifically, it delegates the interface scalability issues from the collective interfaces to the datatype interfaces (Section 3). To exploit this further we introduce a number of new, more space economic datatype constructors for expressing semi-regular data distributions that we believe often occur in applications. We argue that these new datatypes do not introduce any new, fundamental implementation problems, and can all be implemented efficiently by known methods as in, e.g., [9] and many subsequent papers. This alleviates the interface scalability issue for the irregular MPI collective interfaces (Section 4). The decoupling of data description from the collective interfaces makes it possible to perform certain algorithmic optimizations (finding regularities, compressing index and count arrays, etc.) in advance, and potentially amortize the costs of such analyses over a number of collective calls. This provides an alternative approach to persistent collective operations. It should be pointed out that the proposal does not change the communication patterns that can be expressed by the MPI collectives. It is orthogonal to proposals for sparse collective communication for MPI [5], that address other issues in collective communication. This and other issues and alternatives are discussed at length (Section 5).

By an example we demonstrate expressivity and scalability limitations in the current collective interfaces. The example furthermore illustrates where the derived datatype mechanism not only serves expressive purposes but can actually delegate real work to the MPI library (Section 2). Three similar applications were recently discussed and benchmarked in comparison to implementations not using MPI datatypes in [1], and in some cases better performance was achieved with the datatype solution. An elegant example of using a derived datatype together with MPI_Alltoall communication to accomplish a Fast Fourier Transform (FFT) was recently given in [4], illustrating again the considerable, often unexploited potential of the MPI derived datatype mechanism. Earlier work applied derived datatypes to the NAS parallel benchmarks and showed performance and convenience benefits [6].

Section snippets

A motivating example

As concrete, motivating examples we consider an n × n matrix distributed blockwise over the p processes in some communicator (for simplicity MPI_COMM_WORLD). Matrix and submatrices are all stored in arrays in row major. Each processor has a submatrix block of size $n / \sqrt{p} \times n / \sqrt{p}$ as shown in Fig. 1. Assume that at some point all processes need to gather the full n × n matrix; n could for instance have been reduced to something small enough that a full gather would make sense. First we assume that p is a

A concise set of collectives

The example illustrated problems with lack of generality of the MPI collective interfaces and lack of datatype support for expressing regular patterns in connection with collective operations. The present proposal replaces the collective interfaces with new interfaces in which data layout, including counts and displacements, is described solely by a derived datatype argument. This separates completely data description from collective communication or reduction pattern, and reduces the current

The new datatype constructors

From now on we will assume a more intimate familiarity with the derived datatype mechanism of MPI (see [7, Chapter 4]). We describe a different, more space economical set of datatype constructors that fits better with the new collective interfaces introduced in the previous section. Some of these type constructors would be beneficial for MPI as it is.

An MPJ datatype describes a layout of basic types (integers, characters, doubles) in memory by its associated type map. The type map is an ordered

Discussion

Albeit intended as a serious contribution to the discussion of how to make the irregular collective interfaces of MPI more scalable, and in general how to enable scalable application programming with collective communication operations, it is not expected that the proposal as outlined here will be adopted for MPI 3.0. Nevertheless, some of the ideas of additional, more space economic data type constructors (MPJ_Type_create_bucket_index and MPJ_Type_create_segment), and the idea of distributions

Summary and outlook

We presented a reduced, but more expressive set of collective interfaces that fully separates collective communication pattern from local data layout. This separation alleviates (not: solves) the interface scalability problems of the current MPI collectives by delegating the specification of the data distribution entirely to the datatypes. The proposal was complemented by alternative datatype constructors that capture regular patterns currently not expressible in a concise, space economic and

Acknowledgments

The author thanks Rajeev Thakur and George Bosilca for insightful criticism on a first version of this paper, in particularly for pointing out some too strong claims on what is not possible with MPI. Robert van de Geijn, Jack Poulson, Robert M. Kirby and George Bosilca are thanked for comments on (non-)use of MPI datatypes in numerical libraries. The proposal evolved in detail since submission, and has benefited from discussions with Torsten Hoefler, Christian Siebert and Fab Tillier. An

References (9)

Enes Bajrović et al.
Using MPI derived datatypes in numerical libraries
Pavan Balaji et al.
MPI on millions of cores
Parallel Processing Letters
(2011)
William D. Gropp et al.
Performance expectations and guidelines for MPI derived datatypes: a first analysis
Torsten Hoefler et al.
Parallel zero-copy algorithms for fast fourier transform and conjugate gradient using MPI datatypes

There are more references available in the full text version of this article.

Cited by (7)

Flexible all-to-all data redistribution methods for grid-based particle codes
2018, Concurrency and Computation: Practice and Experience
A library for advanced datatype programming
2016, ACM International Conference Proceeding Series
Optimal MPI datatype normalization for vector and index-block types
2014, ACM International Conference Proceeding Series
Zero-copy, hierarchical gather is not possible with MPI datatypes and collectives
2014, ACM International Conference Proceeding Series
MPI collectives and datatypes for hierarchical all-to-all communication
2014, ACM International Conference Proceeding Series
Implementing a classic: Zero-copy all-to-all communication with MPI datatypes
2014, Proceedings of the International Conference on Supercomputing

View all citing articles on Scopus

^☆: This is a substantially revised version of the paper “A (radical) proposal addressing the non-scalability of the irregular MPI collective interfaces” presented at the 16th Workshop on High-level Parallel Programming Models and Supportive Environments (HIPS) at the 25th International Parallel and Distributed Processing Symposium (IPDPS), 2011.

View full text

Alternative, uniformly expressive and more scalable interfaces for collective communication in MPI☆

Abstract

Highlights

Introduction

Section snippets

A motivating example

A concise set of collectives

The new datatype constructors

Discussion

Summary and outlook

Acknowledgments

Using MPI derived datatypes in numerical libraries

MPI on millions of cores

Parallel Processing Letters

Performance expectations and guidelines for MPI derived datatypes: a first analysis

Parallel zero-copy algorithms for fast fourier transform and conjugate gradient using MPI datatypes