research-article

Composable, non-blocking collective operations on power7 IH

Authors:
Gabriel Ilie Tanase

IBM TJ Watson Research Center, Yorktown Heights, NY, USA

IBM TJ Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Gheorghe Almási

IBM TJ Watson Research Center, Yorktown Heights, NY, USA

IBM TJ Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Hanhong Xue

IBM Systems and Technology Group, Poughkeepsie, NY, USA

IBM Systems and Technology Group, Poughkeepsie, NY, USA
View Profile

,
Charles Archer

IBM Systems and Technology Group, Rochester, MN, USA

IBM Systems and Technology Group, Rochester, MN, USA
View Profile

ICS '12: Proceedings of the 26th ACM international conference on SupercomputingJune 2012Pages 215–224https://doi.org/10.1145/2304576.2304605

Published:25 June 2012Publication History

ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Pages 215–224

ABSTRACT

The Power7 IH (P7IH) is one of IBM's latest generation of supercomputers. Like most modern parallel machines, it has a hierarchical organization consisting of simultaneous multithreading (SMT) within a core, multiple cores per processor, multiple processors per node (SMP), and multiple SMPs per cluster. A low latency/high bandwidth network with specialized accelerators is used to interconnect the SMP nodes. System software is tuned to exploit the hierarchical organization of the machine.

In this paper we present a novel set of collective operations that take advantage of the P7IH hardware. We discuss non blocking collective operations implemented using point to point messages, shared memory and accelerator hardware. We show how collectives can be composed to exploit the hierarchical organization of the P7IH for providing low latency, high bandwidth operations. We demonstrate the scalability of the collectives we designed by including experimental results on a P7IH system with up to 4096 cores.

References

Parallel Environment Runtime Edition for AIX, LAPI Programming Guide, Version 1 Release 1.0, IBM.Google Scholar
Parallel Environment Runtime Edition for AIX, PAMI Programming Guide, Version 1 Release 1.0, IBM.Google Scholar
Parallel ESSL for AIX V4.1 Guide and Reference, IBM.Google Scholar
G. Almási and all. Optimization of MPI collective communication on BlueGene/L systems. In Proc. of the 19th International Conference on Supercomputing, ICS'05, pages 253--262, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
G. Almasi, P. Hargrove, I. G. Tanase, and Y. Zheng. UPC Collectives Library 2.0. In Proc. of the Fifth Conference on Partitioned Global Address Space Programming Models, Oct. 2010.Google Scholar
R. X. Arroyo, R. J. Harrington, S. P. Hartman, and T. Nguyen. IBM Power7 systems. IBM Journal of Research and Development, 55(3):2:1 --2:13, May-June 2011. Google ScholarDigital Library
F. Blagojević, P. Hargrove, C. Iancu, and K. Yelick. Hybrid PGAS runtime support for multicore nodes. In Proc. of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS'10, pages 3:1--3:10, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 8(11):1143 --1156, Nov 1997. Google ScholarDigital Library
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. SIGPLAN Not., 40:519--538, October 2005. Google ScholarDigital Library
T. Eicken, D. Culler, S. Goldstein, and K. Schauser. Active messages: A mechanism for integrated communication and computation. In Proc. of the 19th Annual International Symposium on Computer Architecture, 1992, pages 256--266, 1992. Google ScholarDigital Library
E. Gabriel and all. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proc. of 11th European PVM/MPI Users' Group Meeting, pages 97--104, Budapest, Hungary, September 2004.Google Scholar
T. Hoefler and A. Lumsdaine. Optimizing non-blocking Collective Operations for InfiniBand. In Proc. of the 22nd IEEE International Parallel & Distributed Processing Symposium, CAC'08 Workshop, Apr. 2008.Google ScholarCross Ref
T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In Proc. of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM, Nov. 2007. Google ScholarDigital Library
N. Karonis, B. de Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan. Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In Proc. of the 14th Parallel and Distributed Processing Symposium, 2000. IPDPS 2000., pages 377--384, 2000. Google ScholarDigital Library
. Kerbyson and K. Barker. Analyzing the performance bottlenecks of the Power7-IH network. In Proc. of the IEEE International Conference on Cluster Computing (CLUSTER), 2011, pages 244--252, Sept. 2011. Google ScholarDigital Library
S. Kumar and all. The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In Proc. of the 22nd International Conference on Supercomputing, ICS'08, pages 94--103, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
MPI Forum. MPI:a message-passing interface standard (version 1.1). Technical Report (June 1995). available at: http://www.mpi-forum.org (Jan. 2012). Google ScholarDigital Library
R. Rajamony, L. B. Arimilli, and K. Gildea. PERCS: The IBM Power7-ih high-performance computing system. IBM Journal of Research and Development, 55(3):3:1--3:12, May-June 2011. Google ScholarDigital Library
B. Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F. Marino, E. Retter, and P. Williams. IBM Power7 multicore server processor. IBM Journal of Research and Development, 55(3):1:1--1:29, may-june 2011. Google ScholarDigital Library
H. Subramoni, K. Kandalla, S. Sur, and D. Panda. Design and evaluation of generalized collective communication primitives with overlap using connectx-2 offload engine. In Proc. of the IEEE 18th Annual Symposium on High Performance Interconnects (HOTI), 2010 , pages 40--49, aug. 2010. Google ScholarDigital Library
G. Tanase, A. Buss, A. Fidel, Harshvardhan, I. Papadopoulos, O. Pearce, T. Smith, N. Thomas, X. Xu, N. Mourad, J. Vu, M. Bianco, Mauro and N.M. Amato and L. Rauchwerger. The STAPL parallel container framework. In Proc. of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP'11, pages 235--246, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
V. Tipparaju, J. Nieplocha, and D. Panda. Fast collective operations using shared and remote memory access protocols on clusters. In Proc. of the 17th International Symposium on Parallel and Distributed Processing, IPDPS'03, pages 84.1--, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
UPC language specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005.Google Scholar

Index Terms

Composable, non-blocking collective operations on power7 IH
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Scalable reduction collectives with data partitioning-based multi-leader design
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Existing designs for MPI_Allreduce do not take advantage of the vast parallelism available in modern multi-/many-core processors like Intel Xeon/Xeon Phis or the increases in communication throughput and recent advances in high-end features seen with ...
Read More
Challenges in exploitation of loop parallelism in embedded applications
CODES+ISSS '06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesis

Embedded processors have been increasingly exploiting hardware parallelism. Vector units, multiple processors or cores, hyper-threading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a ...
Read More
Preliminary experiences with the uintah framework on Intel Xeon Phi and stampede
XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery

In this work, we describe our preliminary experiences on the Stampede system in the context of the Uintah Computational Framework. Uintah was developed to provide an environment for solving a broad class of fluid-structure interaction problems on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing
June 2012
400 pages
ISBN:9781450313162
DOI:10.1145/2304576
General Chairs:
Utpal Banerjee
University of California at Irvine, USA
,
Kyle A. Gallivan
Florida State University, USA
,
Program Chairs:
Gianfranco Bilardi
Università degli Studi di Padova, Italy
,
Manolis G.H. Katevenis
FORTH and University of Crete, Greece
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 June 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
collectives
composition
hybrid
libraries
messaging
parallel
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 168
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Composable, non-blocking collective operations on power7 IH

ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scalable reduction collectives with data partitioning-based multi-leader design

Challenges in exploitation of loop parallelism in embedded applications

Preliminary experiences with the uintah framework on Intel Xeon Phi and stampede