Alternative, uniformly expressive and more scalable interfaces for collective communication in MPI☆
Highlights
► Significantly extends the expressivity of MPI collective. ► Contributes toward solving known scalability problems of irregular MPI collectives. ► Introduces alternative set of datatype constructors, some new to MPI.
Introduction
It is well acknowledged that the MPI interfaces for the so called irregular (or vector) collective operations [7, Chapter 5] (like MPI_Alltoallw, to mention the most extreme example) suffer from scalability problems in that multiple argument lists of size p (the number of MPI processes in the communicator on which the operation is applied) have to be supplied in the calls, often in application contexts where a considerable amount of regularity and/or redundancy exists, e.g., a large fraction of count values of zero or semi-regular displacement values [2]. At the same time, despite their general appearance, the irregular collective interfaces pose certain limitations as to which data placements can be specified, limiting their use in situations where a derived datatype description of the data placements would be a natural choice. Such an observation previously motivated the generalization of the irregular MPI 1.0 MPI_Alltoallv operation to an additional MPI_Alltoallw operation in MPI 2.0 which in addition to counts and displacements takes lists of possibly different send and receive datatypes as arguments. Curiously, the same generalization was not undertaken for the other irregular collectives, one argument being that MPI_Alltoallw allows the expression of any possible exchange pattern. Although correct, this argument misses the point that for the more specialized operations much more efficient algorithmic realizations exist.
In the MPI design of the collective interfaces there is an intended, clear correspondence between the specification of send and receive buffers and certain, derived (or user-defined) datatypes [7, Chapter 4]. For instance, the data received in the regular MPI_Gather operation are stored as a contiguous block of data blocks, with the block from process i stored as the ith block. Each block is equivalent to a contiguous datatype of the given basetype, and the whole set of blocks to be received can again be described as a contiguous datatype of p blocks. For the irregular MPI_Gatherv operation, the received data are placed as described by an indexed type with p indices of contiguous blocks of the same basetype. Exactly this correspondence with the indexed type is limiting the expressivity of MPI_Gatherv. For instance, interleaving (or tiling) the data received from the p processes is tedious because the indices of the indexed type define displacements in units of the extent of the basetype. To achieve the interleaving effect, a new, resized type with extent equal to the tile size would have to be created. Gathering data of different, unrelated basetypes is not possible at all with this collective operation.
This paper proposes to push the correspondence between the arguments to the collective operations and the MPI derived datatypes to its natural limit. We propose to replace all specifications of collective communication buffers that in MPI are in terms of base addresses, counts, displacements and basetypes with only a base address and a datatype. All structure, including the repetition count of the data to be sent and received will be encapsulated in the datatype. This has the advantage of unifying the interfaces for regular and irregular collective operations, resulting in only one interface for each communication pattern and thereby reducing the overall number of collective interfaces in MPI from currently 16 to 10 (excluding MPI_Barrier). It gives full generality and similar expressivity to all collective operations, preserves the spirit of MPI of explicit distinction between regular and irregular communication patterns and can thus be implemented with the same, small overhead. Specifically, it delegates the interface scalability issues from the collective interfaces to the datatype interfaces (Section 3). To exploit this further we introduce a number of new, more space economic datatype constructors for expressing semi-regular data distributions that we believe often occur in applications. We argue that these new datatypes do not introduce any new, fundamental implementation problems, and can all be implemented efficiently by known methods as in, e.g., [9] and many subsequent papers. This alleviates the interface scalability issue for the irregular MPI collective interfaces (Section 4). The decoupling of data description from the collective interfaces makes it possible to perform certain algorithmic optimizations (finding regularities, compressing index and count arrays, etc.) in advance, and potentially amortize the costs of such analyses over a number of collective calls. This provides an alternative approach to persistent collective operations. It should be pointed out that the proposal does not change the communication patterns that can be expressed by the MPI collectives. It is orthogonal to proposals for sparse collective communication for MPI [5], that address other issues in collective communication. This and other issues and alternatives are discussed at length (Section 5).
By an example we demonstrate expressivity and scalability limitations in the current collective interfaces. The example furthermore illustrates where the derived datatype mechanism not only serves expressive purposes but can actually delegate real work to the MPI library (Section 2). Three similar applications were recently discussed and benchmarked in comparison to implementations not using MPI datatypes in [1], and in some cases better performance was achieved with the datatype solution. An elegant example of using a derived datatype together with MPI_Alltoall communication to accomplish a Fast Fourier Transform (FFT) was recently given in [4], illustrating again the considerable, often unexploited potential of the MPI derived datatype mechanism. Earlier work applied derived datatypes to the NAS parallel benchmarks and showed performance and convenience benefits [6].
Section snippets
A motivating example
As concrete, motivating examples we consider an n × n matrix distributed blockwise over the p processes in some communicator (for simplicity MPI_COMM_WORLD). Matrix and submatrices are all stored in arrays in row major. Each processor has a submatrix block of size as shown in Fig. 1. Assume that at some point all processes need to gather the full n × n matrix; n could for instance have been reduced to something small enough that a full gather would make sense. First we assume that p is a
A concise set of collectives
The example illustrated problems with lack of generality of the MPI collective interfaces and lack of datatype support for expressing regular patterns in connection with collective operations. The present proposal replaces the collective interfaces with new interfaces in which data layout, including counts and displacements, is described solely by a derived datatype argument. This separates completely data description from collective communication or reduction pattern, and reduces the current
The new datatype constructors
From now on we will assume a more intimate familiarity with the derived datatype mechanism of MPI (see [7, Chapter 4]). We describe a different, more space economical set of datatype constructors that fits better with the new collective interfaces introduced in the previous section. Some of these type constructors would be beneficial for MPI as it is.
An MPJ datatype describes a layout of basic types (integers, characters, doubles) in memory by its associated type map. The type map is an ordered
Discussion
Albeit intended as a serious contribution to the discussion of how to make the irregular collective interfaces of MPI more scalable, and in general how to enable scalable application programming with collective communication operations, it is not expected that the proposal as outlined here will be adopted for MPI 3.0. Nevertheless, some of the ideas of additional, more space economic data type constructors (MPJ_Type_create_bucket_index and MPJ_Type_create_segment), and the idea of distributions
Summary and outlook
We presented a reduced, but more expressive set of collective interfaces that fully separates collective communication pattern from local data layout. This separation alleviates (not: solves) the interface scalability problems of the current MPI collectives by delegating the specification of the data distribution entirely to the datatypes. The proposal was complemented by alternative datatype constructors that capture regular patterns currently not expressible in a concise, space economic and
Acknowledgments
The author thanks Rajeev Thakur and George Bosilca for insightful criticism on a first version of this paper, in particularly for pointing out some too strong claims on what is not possible with MPI. Robert van de Geijn, Jack Poulson, Robert M. Kirby and George Bosilca are thanked for comments on (non-)use of MPI datatypes in numerical libraries. The proposal evolved in detail since submission, and has benefited from discussions with Torsten Hoefler, Christian Siebert and Fab Tillier. An
References (9)
- et al.
Using MPI derived datatypes in numerical libraries
- et al.
MPI on millions of cores
Parallel Processing Letters
(2011) - et al.
Performance expectations and guidelines for MPI derived datatypes: a first analysis
- et al.
Parallel zero-copy algorithms for fast fourier transform and conjugate gradient using MPI datatypes
Cited by (7)
Flexible all-to-all data redistribution methods for grid-based particle codes
2018, Concurrency and Computation: Practice and ExperienceA library for advanced datatype programming
2016, ACM International Conference Proceeding SeriesOptimal MPI datatype normalization for vector and index-block types
2014, ACM International Conference Proceeding SeriesZero-copy, hierarchical gather is not possible with MPI datatypes and collectives
2014, ACM International Conference Proceeding SeriesMPI collectives and datatypes for hierarchical all-to-all communication
2014, ACM International Conference Proceeding SeriesImplementing a classic: Zero-copy all-to-all communication with MPI datatypes
2014, Proceedings of the International Conference on Supercomputing
- ☆
This is a substantially revised version of the paper “A (radical) proposal addressing the non-scalability of the irregular MPI collective interfaces” presented at the 16th Workshop on High-level Parallel Programming Models and Supportive Environments (HIPS) at the 25th International Parallel and Distributed Processing Symposium (IPDPS), 2011.