ABSTRACT
The Power7 IH (P7IH) is one of IBM's latest generation of supercomputers. Like most modern parallel machines, it has a hierarchical organization consisting of simultaneous multithreading (SMT) within a core, multiple cores per processor, multiple processors per node (SMP), and multiple SMPs per cluster. A low latency/high bandwidth network with specialized accelerators is used to interconnect the SMP nodes. System software is tuned to exploit the hierarchical organization of the machine.
In this paper we present a novel set of collective operations that take advantage of the P7IH hardware. We discuss non blocking collective operations implemented using point to point messages, shared memory and accelerator hardware. We show how collectives can be composed to exploit the hierarchical organization of the P7IH for providing low latency, high bandwidth operations. We demonstrate the scalability of the collectives we designed by including experimental results on a P7IH system with up to 4096 cores.
- Parallel Environment Runtime Edition for AIX, LAPI Programming Guide, Version 1 Release 1.0, IBM.Google Scholar
- Parallel Environment Runtime Edition for AIX, PAMI Programming Guide, Version 1 Release 1.0, IBM.Google Scholar
- Parallel ESSL for AIX V4.1 Guide and Reference, IBM.Google Scholar
- G. Almási and all. Optimization of MPI collective communication on BlueGene/L systems. In Proc. of the 19th International Conference on Supercomputing, ICS'05, pages 253--262, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- G. Almasi, P. Hargrove, I. G. Tanase, and Y. Zheng. UPC Collectives Library 2.0. In Proc. of the Fifth Conference on Partitioned Global Address Space Programming Models, Oct. 2010.Google Scholar
- R. X. Arroyo, R. J. Harrington, S. P. Hartman, and T. Nguyen. IBM Power7 systems. IBM Journal of Research and Development, 55(3):2:1 --2:13, May-June 2011. Google ScholarDigital Library
- F. Blagojević, P. Hargrove, C. Iancu, and K. Yelick. Hybrid PGAS runtime support for multicore nodes. In Proc. of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS'10, pages 3:1--3:10, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 8(11):1143 --1156, Nov 1997. Google ScholarDigital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. SIGPLAN Not., 40:519--538, October 2005. Google ScholarDigital Library
- T. Eicken, D. Culler, S. Goldstein, and K. Schauser. Active messages: A mechanism for integrated communication and computation. In Proc. of the 19th Annual International Symposium on Computer Architecture, 1992, pages 256--266, 1992. Google ScholarDigital Library
- E. Gabriel and all. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proc. of 11th European PVM/MPI Users' Group Meeting, pages 97--104, Budapest, Hungary, September 2004.Google Scholar
- T. Hoefler and A. Lumsdaine. Optimizing non-blocking Collective Operations for InfiniBand. In Proc. of the 22nd IEEE International Parallel & Distributed Processing Symposium, CAC'08 Workshop, Apr. 2008.Google ScholarCross Ref
- T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In Proc. of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM, Nov. 2007. Google ScholarDigital Library
- N. Karonis, B. de Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan. Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In Proc. of the 14th Parallel and Distributed Processing Symposium, 2000. IPDPS 2000., pages 377--384, 2000. Google ScholarDigital Library
- . Kerbyson and K. Barker. Analyzing the performance bottlenecks of the Power7-IH network. In Proc. of the IEEE International Conference on Cluster Computing (CLUSTER), 2011, pages 244--252, Sept. 2011. Google ScholarDigital Library
- S. Kumar and all. The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In Proc. of the 22nd International Conference on Supercomputing, ICS'08, pages 94--103, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- MPI Forum. MPI:a message-passing interface standard (version 1.1). Technical Report (June 1995). available at: http://www.mpi-forum.org (Jan. 2012). Google ScholarDigital Library
- R. Rajamony, L. B. Arimilli, and K. Gildea. PERCS: The IBM Power7-ih high-performance computing system. IBM Journal of Research and Development, 55(3):3:1--3:12, May-June 2011. Google ScholarDigital Library
- B. Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F. Marino, E. Retter, and P. Williams. IBM Power7 multicore server processor. IBM Journal of Research and Development, 55(3):1:1--1:29, may-june 2011. Google ScholarDigital Library
- H. Subramoni, K. Kandalla, S. Sur, and D. Panda. Design and evaluation of generalized collective communication primitives with overlap using connectx-2 offload engine. In Proc. of the IEEE 18th Annual Symposium on High Performance Interconnects (HOTI), 2010 , pages 40--49, aug. 2010. Google ScholarDigital Library
- G. Tanase, A. Buss, A. Fidel, Harshvardhan, I. Papadopoulos, O. Pearce, T. Smith, N. Thomas, X. Xu, N. Mourad, J. Vu, M. Bianco, Mauro and N.M. Amato and L. Rauchwerger. The STAPL parallel container framework. In Proc. of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP'11, pages 235--246, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- V. Tipparaju, J. Nieplocha, and D. Panda. Fast collective operations using shared and remote memory access protocols on clusters. In Proc. of the 17th International Symposium on Parallel and Distributed Processing, IPDPS'03, pages 84.1--, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
- UPC language specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005.Google Scholar
Index Terms
- Composable, non-blocking collective operations on power7 IH
Recommendations
Scalable reduction collectives with data partitioning-based multi-leader design
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisExisting designs for MPI_Allreduce do not take advantage of the vast parallelism available in modern multi-/many-core processors like Intel Xeon/Xeon Phis or the increases in communication throughput and recent advances in high-end features seen with ...
Challenges in exploitation of loop parallelism in embedded applications
CODES+ISSS '06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesisEmbedded processors have been increasingly exploiting hardware parallelism. Vector units, multiple processors or cores, hyper-threading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a ...
Preliminary experiences with the uintah framework on Intel Xeon Phi and stampede
XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to DiscoveryIn this work, we describe our preliminary experiences on the Stampede system in the context of the Uintah Computational Framework. Uintah was developed to provide an environment for solving a broad class of fluid-structure interaction problems on ...
Comments