skip to main content
10.1145/2304576.2304605acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Composable, non-blocking collective operations on power7 IH

Authors Info & Claims
Published:25 June 2012Publication History

ABSTRACT

The Power7 IH (P7IH) is one of IBM's latest generation of supercomputers. Like most modern parallel machines, it has a hierarchical organization consisting of simultaneous multithreading (SMT) within a core, multiple cores per processor, multiple processors per node (SMP), and multiple SMPs per cluster. A low latency/high bandwidth network with specialized accelerators is used to interconnect the SMP nodes. System software is tuned to exploit the hierarchical organization of the machine.

In this paper we present a novel set of collective operations that take advantage of the P7IH hardware. We discuss non blocking collective operations implemented using point to point messages, shared memory and accelerator hardware. We show how collectives can be composed to exploit the hierarchical organization of the P7IH for providing low latency, high bandwidth operations. We demonstrate the scalability of the collectives we designed by including experimental results on a P7IH system with up to 4096 cores.

References

  1. Parallel Environment Runtime Edition for AIX, LAPI Programming Guide, Version 1 Release 1.0, IBM.Google ScholarGoogle Scholar
  2. Parallel Environment Runtime Edition for AIX, PAMI Programming Guide, Version 1 Release 1.0, IBM.Google ScholarGoogle Scholar
  3. Parallel ESSL for AIX V4.1 Guide and Reference, IBM.Google ScholarGoogle Scholar
  4. G. Almási and all. Optimization of MPI collective communication on BlueGene/L systems. In Proc. of the 19th International Conference on Supercomputing, ICS'05, pages 253--262, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Almasi, P. Hargrove, I. G. Tanase, and Y. Zheng. UPC Collectives Library 2.0. In Proc. of the Fifth Conference on Partitioned Global Address Space Programming Models, Oct. 2010.Google ScholarGoogle Scholar
  6. R. X. Arroyo, R. J. Harrington, S. P. Hartman, and T. Nguyen. IBM Power7 systems. IBM Journal of Research and Development, 55(3):2:1 --2:13, May-June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. Blagojević, P. Hargrove, C. Iancu, and K. Yelick. Hybrid PGAS runtime support for multicore nodes. In Proc. of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS'10, pages 3:1--3:10, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 8(11):1143 --1156, Nov 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. SIGPLAN Not., 40:519--538, October 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Eicken, D. Culler, S. Goldstein, and K. Schauser. Active messages: A mechanism for integrated communication and computation. In Proc. of the 19th Annual International Symposium on Computer Architecture, 1992, pages 256--266, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. Gabriel and all. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proc. of 11th European PVM/MPI Users' Group Meeting, pages 97--104, Budapest, Hungary, September 2004.Google ScholarGoogle Scholar
  12. T. Hoefler and A. Lumsdaine. Optimizing non-blocking Collective Operations for InfiniBand. In Proc. of the 22nd IEEE International Parallel & Distributed Processing Symposium, CAC'08 Workshop, Apr. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  13. T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In Proc. of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM, Nov. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Karonis, B. de Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan. Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In Proc. of the 14th Parallel and Distributed Processing Symposium, 2000. IPDPS 2000., pages 377--384, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. . Kerbyson and K. Barker. Analyzing the performance bottlenecks of the Power7-IH network. In Proc. of the IEEE International Conference on Cluster Computing (CLUSTER), 2011, pages 244--252, Sept. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Kumar and all. The deep computing messaging framework: generalized scalable message passing on the Blue Gene/P supercomputer. In Proc. of the 22nd International Conference on Supercomputing, ICS'08, pages 94--103, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. MPI Forum. MPI:a message-passing interface standard (version 1.1). Technical Report (June 1995). available at: http://www.mpi-forum.org (Jan. 2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Rajamony, L. B. Arimilli, and K. Gildea. PERCS: The IBM Power7-ih high-performance computing system. IBM Journal of Research and Development, 55(3):3:1--3:12, May-June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F. Marino, E. Retter, and P. Williams. IBM Power7 multicore server processor. IBM Journal of Research and Development, 55(3):1:1--1:29, may-june 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Subramoni, K. Kandalla, S. Sur, and D. Panda. Design and evaluation of generalized collective communication primitives with overlap using connectx-2 offload engine. In Proc. of the IEEE 18th Annual Symposium on High Performance Interconnects (HOTI), 2010 , pages 40--49, aug. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Tanase, A. Buss, A. Fidel, Harshvardhan, I. Papadopoulos, O. Pearce, T. Smith, N. Thomas, X. Xu, N. Mourad, J. Vu, M. Bianco, Mauro and N.M. Amato and L. Rauchwerger. The STAPL parallel container framework. In Proc. of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP'11, pages 235--246, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V. Tipparaju, J. Nieplocha, and D. Panda. Fast collective operations using shared and remote memory access protocols on clusters. In Proc. of the 17th International Symposium on Parallel and Distributed Processing, IPDPS'03, pages 84.1--, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. UPC language specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005.Google ScholarGoogle Scholar

Index Terms

  1. Composable, non-blocking collective operations on power7 IH

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICS '12: Proceedings of the 26th ACM international conference on Supercomputing
        June 2012
        400 pages
        ISBN:9781450313162
        DOI:10.1145/2304576

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 June 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate584of2,055submissions,28%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader