Hierarchical Optimization of MPI Reduce Algorithms

Hasanov, Khalid; Lastovetsky, Alexey

doi:10.1007/978-3-319-21909-7_3

Khalid Hasanov¹⁴ &
Alexey Lastovetsky¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9251))

Included in the following conference series:

International Conference on Parallel Computing Technologies

1109 Accesses
4 Citations

Abstract

Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in 1990s. Many general and architecture-specific collective algorithms have been proposed and implemented in the state-of-the-art MPI implementations. Hierarchical topology-oblivious transformation of existing communication algorithms has been recently proposed as a new promising approach to optimization of MPI collective communication algorithms and MPI-based applications. This approach has been successfully applied to the most popular parallel matrix multiplication algorithm, SUMMA, and the state-of-the-art MPI broadcast algorithms, demonstrating significant multi-fold performance gains, especially for large-scale HPC systems. In this paper, we apply this approach to optimization of the MPI reduce operation. Theoretical analysis and experimental results on a cluster of Grid’5000 platform are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Message passing interface forum. http://www.mpi-forum.org/
Rabenseifner, R.: Automatic MPI counter proling of all users: first results on a CRAY T3E 900–512. Proceedings of the Message Passing Interface Developers and Users Conference 1999(MPIDC99), 77–85 (1999)
Google Scholar
Hasanov, K., Quintin, J.N., Lastovetsky, A.: Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms. J. Supercomputing., 24p. March 2014 (Springer). doi:10.1007/s11227-014-1133-x
van de Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9(4), 255–274 (1997)
Article Google Scholar
Hasanov, K., Quintin, J.-N., Lastovetsky, A.: High-level topology-oblivious optimization of mpi broadcast algorithms on extreme-scale platforms. In: Lopes, L., Žilinskas, J., Costan, A., Cascella, R.G., Kecskemeti, G., Jeannot, E., Cannataro, M., Ricci, L., Benkner, S., Petit, S., Scarano, V., Gracia, J., Hunold, S., Scott, S.L., Lankes, S., Lengauer, C., Carretero, J., Breitbart, J., Alexander, M. (eds.) Euro-Par 2014, Part II. LNCS, vol. 8806, pp. 412–424. Springer, Heidelberg (2014)
Google Scholar
Hasanov, K., Quintin, J.N., Lastovetsky, A.: Topology-oblivious optimization of MPI broadcast algorithms on extreme-scale platforms. Simulation Modelling Practice and Theory. 10p. April 2015. doi:10.1016/j.simpat.2015.03.005
Gabriel, E., Fagg, G., Bosilca, G., Angskun, T., Dongarra, J., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings of the 11th European PVM/MPI Users Group Meeting (2004)
Google Scholar
Grid’5000. http://www.grid5000.fr
Bala, V., Bruck, J., Cypher, R., Elustondo, P., Ho, C.-T., Ho, C.-T., Kipnis, S., Snir, M.: CCL: a portable and tunable collective communication library for scalable parallel computers. IEEE TPDS 6(2), 154–164 (1995)
Google Scholar
Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: MagPIe MPIs collective communication operations for clustered wide area systems. In: Proceedings of PPoPP99, 34(8): 131–140 (1999)
Google Scholar
Vadhiyar, S.S., Fagg, G.E., Dongarra, J.: Automatically tuned collective communications. In: Proceedings of ACM/IEEE Conference on Supercomputing (2000)
Google Scholar
Chan, E.W., Heimlich, M.F., Purkayastha, A., Van de Geijn, R.A.: On optimizing collective communication. In: Proceedings of IEEE International Conference on Cluster Computing (2004)
Google Scholar
Rabenseifner, R.: Optimization of collective reduction operations. In: Proceddings of International Conference on Computational Science, June 2004
Google Scholar
Sanders, P., Speck, J., Tráff, J.L.: Two-tree algorithms for full bandwidth broadcast. Reduct. Scan. Parallel Comput. 35(12), 581–594 (2009)
Article Google Scholar
MPICH-A Portable Implementation of MPI. http://www.mpich.org/
Thakur, R., Gropp, W.D.: Improving the performance of collective operations in MPICH. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 257–267. Springer, Heidelberg (2003)
Chapter Google Scholar
Venkata, M.G., Shamis, P., Sampath, R., Graham, R.L.l, Ladd, J.S.: Optimizing blocking and nonblocking reduction operations for multicore systems: hierarchical design and implementation. In: Proceedings of IEEE Cluster, pp. 1–8 (2013)
Google Scholar
Hockney, R.W.: The communication challenge for MPP: intel paragon and Meiko CS-2. Parallel Comput. 20(3), 389–398 (1994)
Article Google Scholar
Lastovetsky, A., Rychkov, V., O’Flynn, M.: MPIBlib: benchmarking MPI communications for parallel computing on homogeneous and heterogeneous clusters. In: Lastovetsky, A., Kechadi, T., Dongarra, J. (eds.) EuroPVM/MPI 2008. LNCS, vol. 5205, pp. 227–238. Springer, Heidelberg (2008)
Chapter Google Scholar
Sack, P., Gropp, W.: A scalable MPI\_comm\_split algorithm for exascale computing. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 1–10. Springer, Heidelberg (2010)
Chapter Google Scholar
Moody, A., Ahn, D.H., de Supinski, B.R.: Exascale algorithms for generalized MPI\_comm\_split. In: Proceedings of the 18th European MPI Users’ Group conference on Recent advances in the message passing interface (EuroMPI 2011) (2011)
Google Scholar

Download references

Acknowledgments

This work has emanated from research conducted with the financial support of IRCSET (Irish Research Council for Science, Engineering and Technology) and IBM, grant number EPSPG/2011/188, and Science Foundation Ireland, grant number 08/IN.1/I2054.

The experiments presented in this publication were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several Universities as well as other funding bodies (see https://www.grid5000.fr).

Author information

Authors and Affiliations

University College Dublin, Belfield, Dublin 4, Ireland
Khalid Hasanov & Alexey Lastovetsky

Authors

Khalid Hasanov
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Lastovetsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khalid Hasanov .

Editor information

Editors and Affiliations

Russian Academy of Sciences, Novosibirsk, Russia
Victor Malyshkin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hasanov, K., Lastovetsky, A. (2015). Hierarchical Optimization of MPI Reduce Algorithms. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2015. Lecture Notes in Computer Science(), vol 9251. Springer, Cham. https://doi.org/10.1007/978-3-319-21909-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-21909-7_3
Published: 25 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21908-0
Online ISBN: 978-3-319-21909-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics