Abstract
We present improved algorithms for global reduction operations for message-passing systems. Each of p processors has a vector of m data items, and we want to compute the element-wise “sum” under a given, associative function of the p vectors. The result, which is also a vector of m items, is to be stored at either a given root processor (MPI_Reduce), or all p processors (MPI_Allreduce). A further constraint is that for each data item and each processor the result must be computed in the same order, and with the same bracketing. Both problems can be solved in O(m+log2 p) communication and computation time. Such reduction operations are part of MPI (the Message Passing Interface), and the algorithms presented here achieve significant improvements over currently implemented algorithms for the important case where p is not a power of 2. Our algorithm requires ⌈log2 p⌉ + 1 rounds – one round off from optimal – for small vectors. For large vectors twice the number of rounds is needed, but the communication and computation time is less than 3mβ and 3/2mγ, respectively, an improvement from 4mβ and 2mγ achieved by previous algorithms (with the message transfer time modeled as α + mβ, and reduction-operation execution time as mγ). For p=3× 2n and p=9× 2n and small m ≤ b for some threshold b, and p=q 2n with small q, our algorithm achieves the optimal ⌈log2 p⌉ number of rounds.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barnett, M., Gupta, S., Payne, D., Shuler, L., van de Gejin, R., Watts, J.: Interprocessor collective communication library (InterCom). In: Proceedings of Supercomputing 1994 (November 1994)
Bar-Noy, A., Bruck, J., Ho, C.-T., Kipnis, S., Schieber, B.: Computing global combine operations in the multiport postal model. IEEE Transactions on Parallel and Distributed Systems 6(8), 896–900 (1995)
Bar-Noy, A., Kipnis, S., Schieber, B.: An optimal algorithm for computing census functions in message-passing systems. Parallel Processing Letters 3(1), 19–23 (1993)
Blum, E.K., Wang, X., Leung, P.: Architectures and message-passing algorithms for cluster computing: Design and performance. Parallel Computing 26, 313–332 (2000)
Bruck, J., Ho, C.-T.: Efficient global combine operations in multi-port messagepassing systems. Parallel Processing Letters 3(4), 335–346 (1993)
Bruck, J., Ho, C.-T., Kipnis, S., Upfal, E., Weathersby, D.: Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems 8(11), 1143–1156 (1997)
Gabriel, E., Resch, M., Rühle, R.: Implementing MPI with optimized algorithms for metacomputing. In: Proceedings of the MPIDC 1999, Atlanta, USA, pp. 31–41 (1999)
Karonis, N., de Supinski, B., Foster, I., Gropp, W., Lusk, E., Bresnahan, J.: Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In: Proceedings of the 14th International Parallel and Distributed Processing Symposium (IPDPS 2000), pp. 377–384 (2000)
Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: MPI’s reduction operations in clustered wide area systems. In: Proceedings of the MPIDC 1999, pp. 43–52 (1999)
Knies, A.D., Ray Barriuso, F., Adams III, W.J.H.G.B.: SLICC: A low latency interface for collective communications. In: Proceedings of the 1994 conference on Supercomputing, Washington, D.C, November 14–18, pp. 89–96 (1994)
Pritchard, H., Nicholson, J., Schwarzmeier, J.: Optimizing MPI Collectives for the Cray X1. In: Proceeding of the CUG 2004 conference, Knoxville, Tennessee, USA, May 17-21 (2004) (personal communication)
Rabenseifner, R.: Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512. In: Proceedings of the Message Passing Interface Developer’s and User’s Conference 1999 (MPIDC 1999), Atlanta, USA, March 1999, pp. 77–85 (1999)
Rabenseifner, R.: Optimization of collective reduction operations. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3036, pp. 1–9. Springer, Heidelberg (2004)
Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI – The Complete Reference, 2nd edn. The MPI Core, vol. 1. MIT Press, Cambridge (1998)
Thakur, R., Gropp, W.D.: Improving the performance of collective operations in MPICH. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 257–267. Springer, Heidelberg (2003)
van de Geijn, R.: On global combine operations. Journal of Parallel and Distributed Computing 22, 324–328 (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rabenseifner, R., Träff, J.L. (2004). More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2004. Lecture Notes in Computer Science, vol 3241. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30218-6_13
Download citation
DOI: https://doi.org/10.1007/978-3-540-30218-6_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23163-9
Online ISBN: 978-3-540-30218-6
eBook Packages: Springer Book Archive