More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems

Rabenseifner, Rolf; Träff, Jesper Larsson

doi:10.1007/978-3-540-30218-6_13

Rolf Rabenseifner¹⁹ &
Jesper Larsson Träff²⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3241))

Included in the following conference series:

European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting

1035 Accesses
26 Citations
3 Altmetric

Abstract

We present improved algorithms for global reduction operations for message-passing systems. Each of p processors has a vector of m data items, and we want to compute the element-wise “sum” under a given, associative function of the p vectors. The result, which is also a vector of m items, is to be stored at either a given root processor (MPI_Reduce), or all p processors (MPI_Allreduce). A further constraint is that for each data item and each processor the result must be computed in the same order, and with the same bracketing. Both problems can be solved in O(m+log₂ p) communication and computation time. Such reduction operations are part of MPI (the Message Passing Interface), and the algorithms presented here achieve significant improvements over currently implemented algorithms for the important case where p is not a power of 2. Our algorithm requires ⌈log₂ p⌉ + 1 rounds – one round off from optimal – for small vectors. For large vectors twice the number of rounds is needed, but the communication and computation time is less than 3mβ and 3/2mγ, respectively, an improvement from 4mβ and 2mγ achieved by previous algorithms (with the message transfer time modeled as α + mβ, and reduction-operation execution time as mγ). For p=3× 2ⁿ and p=9× 2ⁿ and small m ≤ b for some threshold b, and p=q 2ⁿ with small q, our algorithm achieves the optimal ⌈log₂ p⌉ number of rounds.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Barnett, M., Gupta, S., Payne, D., Shuler, L., van de Gejin, R., Watts, J.: Interprocessor collective communication library (InterCom). In: Proceedings of Supercomputing 1994 (November 1994)
Google Scholar
Bar-Noy, A., Bruck, J., Ho, C.-T., Kipnis, S., Schieber, B.: Computing global combine operations in the multiport postal model. IEEE Transactions on Parallel and Distributed Systems 6(8), 896–900 (1995)
Article Google Scholar
Bar-Noy, A., Kipnis, S., Schieber, B.: An optimal algorithm for computing census functions in message-passing systems. Parallel Processing Letters 3(1), 19–23 (1993)
Article Google Scholar
Blum, E.K., Wang, X., Leung, P.: Architectures and message-passing algorithms for cluster computing: Design and performance. Parallel Computing 26, 313–332 (2000)
Article MATH Google Scholar
Bruck, J., Ho, C.-T.: Efficient global combine operations in multi-port messagepassing systems. Parallel Processing Letters 3(4), 335–346 (1993)
Article Google Scholar
Bruck, J., Ho, C.-T., Kipnis, S., Upfal, E., Weathersby, D.: Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems 8(11), 1143–1156 (1997)
Article Google Scholar
Gabriel, E., Resch, M., Rühle, R.: Implementing MPI with optimized algorithms for metacomputing. In: Proceedings of the MPIDC 1999, Atlanta, USA, pp. 31–41 (1999)
Google Scholar
Karonis, N., de Supinski, B., Foster, I., Gropp, W., Lusk, E., Bresnahan, J.: Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In: Proceedings of the 14th International Parallel and Distributed Processing Symposium (IPDPS 2000), pp. 377–384 (2000)
Google Scholar
Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: MPI’s reduction operations in clustered wide area systems. In: Proceedings of the MPIDC 1999, pp. 43–52 (1999)
Google Scholar
Knies, A.D., Ray Barriuso, F., Adams III, W.J.H.G.B.: SLICC: A low latency interface for collective communications. In: Proceedings of the 1994 conference on Supercomputing, Washington, D.C, November 14–18, pp. 89–96 (1994)
Google Scholar
Pritchard, H., Nicholson, J., Schwarzmeier, J.: Optimizing MPI Collectives for the Cray X1. In: Proceeding of the CUG 2004 conference, Knoxville, Tennessee, USA, May 17-21 (2004) (personal communication)
Google Scholar
Rabenseifner, R.: Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512. In: Proceedings of the Message Passing Interface Developer’s and User’s Conference 1999 (MPIDC 1999), Atlanta, USA, March 1999, pp. 77–85 (1999)
Google Scholar
Rabenseifner, R.: Optimization of collective reduction operations. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3036, pp. 1–9. Springer, Heidelberg (2004)
Chapter Google Scholar
Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI – The Complete Reference, 2nd edn. The MPI Core, vol. 1. MIT Press, Cambridge (1998)
Google Scholar
Thakur, R., Gropp, W.D.: Improving the performance of collective operations in MPICH. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 257–267. Springer, Heidelberg (2003)
Chapter Google Scholar
van de Geijn, R.: On global combine operations. Journal of Parallel and Distributed Computing 22, 324–328 (1994)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

High-Performance Computing-Center (HLRS), University of Stuttgart, Allmandring 30, D-70550, Stuttgart, Germany
Rolf Rabenseifner
C&C Research Laboratories, NEC Europe Ltd, Rathausallee 10, D-53757, Sankt Augustin, Germany
Jesper Larsson Träff

Authors

Rolf Rabenseifner
View author publications
You can also search for this author in PubMed Google Scholar
Jesper Larsson Träff
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

GUP, Institute of Graphics and Parallel Processing, Johannes Kepler University, Altenbergerstraße 69, A-4040, Linz, Austria
Dieter Kranzlmüller
MTA SZTAKI, Computer and Automation Research Institute, Hungarian Academy of Sciences, P.O. Box 63, H-1518, Hungary
Péter Kacsuk
Computer Science Department, University of Tennessee, TN 37996-3450, Knoxville, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rabenseifner, R., Träff, J.L. (2004). More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2004. Lecture Notes in Computer Science, vol 3241. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30218-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-540-30218-6_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23163-9
Online ISBN: 978-3-540-30218-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics