Abstract
The increasing popularity of multi-core processors has made MPI intra-node communication, including the intra-node RMA (Remote Memory Access) communication, a critical component in high performance computing. MPI-2 RMA model includes one-sided data transfer and synchronization operations. Existing designs in popularly used MPI stacks do not provide truly one-sided intra-node RMA communication. They are built on top of two-sided send-receive operations, therefore suffering from overheads of two-sided communication and dependency on the remote side. In this paper, we enhance existing shared memory mechanisms to design truly one-sided synchronization. In addition, we design truly one-sided intra-node data transfer using two kernel based direct copy alternatives: basic kernel-assisted approach and I/OAT-assisted approach. Our new design eliminates the overhead of using two-sided operations and eliminates the involvement from the remote side. We also propose a series of benchmarks to evaluate various performance aspects over multi-core architectures (Intel Clovertown, Intel Nehalem and AMD Barcelona). The results show that the new design obtains up to 39% lower latency for small and medium messages and demonstrates 29% improvement in large message bandwidth. Moreover, it provides superior performance in terms of better scalability, reduced cache misses, higher resilience to process skew and increased computation and communication overlap. Finally, up to 10% performance benefits is demonstrated for a real scientific application AWM-Olsen.
Similar content being viewed by others
References
I/OAT Acceleration Technology. http://www.intel.com/network/connectivity/vtc_ioat.htm
LAM/MPI Parallel Computing. http://www.lam-mpi.org/
MPICH2: High Performance and Widely Portable MPI. http://www.mcs.anl.gov/research/projects/mpich2/
MVAPICH2: MPI over InfiniBand and iWARP. http://mvapich.cse.ohio-state.edu/
OSU Microbenchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/
Unified Parallel C. http://en.wikipedia.org/wiki/Unified_Parallel_C
Asai N, Kentemich T, Lagier P (1999) MPI-2 implementation on Fujitsu generic message passing kernel. In: Proceedings of the ACM/IEEE conference on supercomputing (CDROM)
Barrett BW, Shipman GM, Lumsdaine A (2007) Analysis of implementation options for MPI-2 one-sided. In: EuroPVM/MPI
Bertozzi M, Panella M, Reggiani M (2001) Design of a VIA based communication protocol for LAM/MPI suite. In: Euromicro workshop on parallel and distributed processing
Booth S, Mourao E (2000) Single sided MPI implementations for SUN MPI. In: Proceedings of the 2000 ACM/IEEE conference on supercomputing
Buntinas D, Goglin B, Goodell D, Mercier G, Moreaud S (2009) Cache-efficient, intranode, large-message MPI communication with MPICH2-Nemesis. In: International conference on parallel processing (ICPP)
Chai L, Lai P, Jin H-W, Panda DK (2008) Designing an efficient kernel-level and user-level hybrid approach for MPI intra-node communication on multi-core systems. In: International conference on parallel processing (ICPP)
Cui Y, Moore R, Olsen K, Chourasia A, Maechling P, Minster B, Day S, Hu Y, Zhu J, Jordan T Toward petascale earthquake simulations. In: Special issue on geodynamic modeling, vol. 4, July 2009
Forum M (1993) MPI: a message passing interface. In: Proceedings of supercomputing
Goglin B (2009) High throughput intra-node MPI communication with Open-MX. In: Proceedings of the 17th Euromicro international conference on parallel, distributed and network-based processing (PDP)
Träff JL, Ritzdorf H, Hempel R (2000) The implementation of MPI-2 one-sided communication for the NEC SX-5. In: Proceedings of the 2000 ACM/IEEE conference on supercomputing
Jiang W, Liu J, Jin H, Panda DK, Gropp W, Thakur R (2004) High performance MPI-2 one-sided communication over infiniband. In: IEEE/ACM international symposium on cluster computing and the grid (CCGrid 04)
Jiang W, Liu JX, Jin H-W, Panda DK, Buntinas D, Thakur R, Gropp W (2004) Efficient implementation of MPI-2 passive one-sided communication on infiniBand clusters. In: EuroPVM/MPI
Jin H-W, Sur S, Chai L, Panda DK (2008) Lightweight kernel-level primitives for high-performance MPI intra-node communication over multi-core systems. In: IEEE international sympsoium on cluster computing and the grid
Lai P, Panda DK (2009) Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems. In: Technical Report OSU-CISRC-9/09-TR46, Computer Science and Engineering, The Ohio State University
Message Passing Interface Forum. MPI-2: extensions to the message-passing interface, July 1997
Santhanaraman G, Balaji P, Gopalakrishnan K, Thakur R, Gropp WD, Panda DK (2009) Natively supporting true one-sided communication in MPI on multi-core systems with infiniband. In: IEEE international sympsoium on cluster computing and the grid
Santhanaraman G, Narravula S, Panda DK (2008) Designing passive synchronization for MPI-2 one-sided communication to maximize overlap. In: Int’l Parallel and Distributed Processing Symposium (IPDPS)
Thakur R, Gropp W, Toonen B (2005) Optimizing the synchronization operations in message passing interface one-sided communication. Int J High Perform Comput Appl
Author information
Authors and Affiliations
Corresponding author
Additional information
This research is supported in part by DOE grants #DE-FC02-06ER25749 and #DE-FC02-06ER25755; NSF grants #CNS-0403342, #CCF-0702675, #CCF-0833169, #CCF-0916302 and #OCI-0926691; grants from Intel, Mellanox, Cisco systems, QLogic and Sun Microsystems; and equipment donations from Intel, Mellanox, AMD, Appro, Chelsio, Dell, Fujitsu, Fulcrum, Microway, Obsidian, QLogic, and Sun Microsystems.
Rights and permissions
About this article
Cite this article
Lai, P., Sur, S. & Panda, D.K. Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems. Comput Sci Res Dev 25, 3–14 (2010). https://doi.org/10.1007/s00450-010-0115-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-010-0115-3