skip to main content
10.1145/3155105.3155111acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Lessons from the IBM Blue Gene Series of Supercomputers

Authors Info & Claims
Published:12 November 2017Publication History

ABSTRACT

The Argonne Leadership Computing Facility has operated IBM Blue Gene/L, /P, and /Q series supercomputers for over a decade. This paper discusses the lessons the authors learned from the Blue Gene architecture that are generally applicable to the design and operation of large high performance computing systems.

References

  1. Warewulf - Scalable, Modular, Adaptable Systems Management. Retrieved September 30, 2017 from http://warewulf.lbl.gov/Google ScholarGoogle Scholar
  2. Nagios. Retrieved September 30, 2017 from https://www.nagios.org/Google ScholarGoogle Scholar
  3. Performance Co-Pilot. Retrieved September 30, 2017 from http://pcp.io/Google ScholarGoogle Scholar
  4. P. A. Boyle, D. Chen, N. H. Christ, M. A. Clark, S. D. Cohen, C. Cristian, Z. Dong, A. Gara, B. Joo, C. Jung, C. Kim, L. A. Levkova, X. Liao, G. Liu, R. D. Mawhinney, S. Ohta, K. Petrov, T. Wettig, and A. Yamaguchi. 2005. Overview of the QCDSP and QCDOC computers. IBM Journal of Research and Development 49, 2.3 (March 2005), 351--365. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. M. Brandt, A. C. Gentile, D. J. Hale, and P. P. Pébay. 2006. OVIS: A Tool for Intelligent, Real-time Monitoring of Computational Clusters. In Proceedings of the 20th International Conference on Parallel and Distributed Processing (IPDPS'06). IEEE Computer Society, Washington, DC, USA, 365--365. http://dl.acm.org/citation.cfm?id=1898699.1898925 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. James Carey and Philip Sanders. 2011. A Toolkit for Event Analysis and Logging. In State of the Practice Reports (SC '11). ACM, New York, NY, USA, Article 24, 7 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Chen, N.H. Christ, C. Cristian, Z. Dong, A. Gara, K. Garg, B. Joo, C. Kim, L. Levkova, X. Liao, R.D. Mawhinney, S. Ohta, and T. Wettig. 2001. "QCDOC: A 10-teraflops Scale Computer for Lattice QCD". Nuclear Physics B - Proceedings Supplements 94, 1 (2001), 825--832. Proceedings of the XVIIIth International Symposium on Lattice Field Theory.Google ScholarGoogle ScholarCross RefCross Ref
  8. Narayan Desai, Rick Bradshaw, Cory Lueninghoener, Andrew Cherry, Susan Coghlan, and William Scullin. 2008. Petascale System Management Experiences. In LISA. 41--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Alan Gara, Matthias A Blumrich, Dong Chen, GL-T Chiu, Paul Coteus, Mark E Giampapa, Ruud A Haring, Philip Heidelberger, Dirk Hoenicke, Gerard V Kopcsay, et al. 2005. Overview of the Blue Gene/L system architecture. IBM Journal of Research and Development 49, 2.3 (2005), 195--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jim E Garlick. 2012. I/O Forwarding on Livermore Computing Commodity Linux Clusters. Technical Report. Lawrence Livermore National Laboratory (LLNL), Livermore, CA.Google ScholarGoogle Scholar
  11. Mark Giampapa, Ralph Bellofatto, Matthias Blumrich, Dong Chen, Marc Boris Dombrowa, Alan Gara, Ruud A. Haring, Philip Heidelberger, Dirk Hoenicke, Gerard V. Kopcsay, Ben J. Nathanson, Burkhard D. Steinmacher-Burow, Martin Ohmacht, Valentina Salapura, and Pavlos Vranas. 2005. Blue Gene/L advanced diagnostics environment. IBM Journal of Research and Development 49 (March 2005), 319--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. DB2 security model overview. IBM. Retrieved September 30, 2017 from https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.db2.luw.admin.sec.doc/doc/c0021804.htmlGoogle ScholarGoogle Scholar
  13. IBM Offering Catalog: Blue Gene/Q * supercomputer, Up to 16,384 cores and 16 TB mem. IBM. Retrieved September 30, 2017 from https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=OC&subtype=NA&htmlfid=897/ENUS0207-200Google ScholarGoogle Scholar
  14. Kamil Iskra, John W. Romein, Kazutomo Yoshii, and Pete Beckman. 2008. ZOID: I/O-forwarding Infrastructure for Petascale Architectures. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '08). ACM, New York, NY, USA, 153--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Stefan Lankes, Simon Pickartz, and Jens Breitbart. 2016. HermitCore: A Unikernel for Extreme Scale Computing. In Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS '16). ACM, New York, NY, USA, Article 4, 8 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Robert D. Mawhinney. 1999. "The 1 Teraflops QCDSP computer". Parallel Comput. 25, 10 (1999), 1281--1296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ronald G. Minnich and Jim Mckie. 2009. Experiences porting the Plan 9 research operating system to the IBM Blue Gene supercomputers. Computer Science - Research and Development 23, 3 (01 Jun 2009), 117--124.Google ScholarGoogle Scholar
  18. IBM Redbooks. 2006. Blue Gene/L: System Administration. Vervante. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. IBM Redbooks. 2009. IBM System Blue Gene Solution: Blue Gene/P System Administration. Vervante. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. IBM Redbooks. 2012. IBM System Blue Gene Solution: Blue Gene/Q System Administration. Vervante. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Valentina Salapura, Karthik Ganesan, Alan Gara, Michael Gschwind, James C Sexton, and Robert E Walkup. 2008. Next-generation performance counters: Towards monitoring over thousand concurrent events. In Performance Analysis of Systems and software, 2008. ISPASS 2008. IEEE International Symposium on. IEEE, 139--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Schroeder and G. Gibson. 2010. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing 7, 4 (Oct 2010), 337--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K. Lockwood, and Nicholas J. Wright. 2016. Modular HPC I/O Characterization with Darshan. In Proceedings of the 5th Workshop on Extreme-Scale Programming Tools (ESPT '16). IEEE Press, Piscataway, NJ, USA, 9--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Eric Van Hensbergen, Charles Forsyth, Jim McKie, and Ron Minnich. 2008. Holistic Aggregate Resource Environment. SIGOPS Oper. Syst. Rev. 42, 1 (Jan. 2008), 85--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kazutomo Yoshii, Kamil Iskra, Harish Naik, Pete Beckmanm, and P Chris Broekema. 2009. Characterizing the performance of "big memory" on blue gene linux. In Parallel Processing Workshops, 2009. ICPPW'09. International Conference on. IEEE, 65--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Fengwei Zhang and Hongwei Zhang. 2016. SoK: A Study of Using Hardware-assisted Isolated Execution Environments for Security. In Proceedings of the Hardware and Architectural Support for Security and Privacy 2016. ACM, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman. 2010. A practical failure prediction with location and lead time for Blue Gene/P. In 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W). 15--22. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Lessons from the IBM Blue Gene Series of Supercomputers

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        HPCSYSPROS'17: Proceedings of the HPC Systems Professionals Workshop
        November 2017
        47 pages
        ISBN:9781450351287
        DOI:10.1145/3155105

        Copyright © 2017 ACM

        Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 November 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        HPCSYSPROS'17 Paper Acceptance Rate6of8submissions,75%Overall Acceptance Rate6of8submissions,75%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader