ABSTRACT
The Argonne Leadership Computing Facility has operated IBM Blue Gene/L, /P, and /Q series supercomputers for over a decade. This paper discusses the lessons the authors learned from the Blue Gene architecture that are generally applicable to the design and operation of large high performance computing systems.
- Warewulf - Scalable, Modular, Adaptable Systems Management. Retrieved September 30, 2017 from http://warewulf.lbl.gov/Google Scholar
- Nagios. Retrieved September 30, 2017 from https://www.nagios.org/Google Scholar
- Performance Co-Pilot. Retrieved September 30, 2017 from http://pcp.io/Google Scholar
- P. A. Boyle, D. Chen, N. H. Christ, M. A. Clark, S. D. Cohen, C. Cristian, Z. Dong, A. Gara, B. Joo, C. Jung, C. Kim, L. A. Levkova, X. Liao, G. Liu, R. D. Mawhinney, S. Ohta, K. Petrov, T. Wettig, and A. Yamaguchi. 2005. Overview of the QCDSP and QCDOC computers. IBM Journal of Research and Development 49, 2.3 (March 2005), 351--365. Google ScholarDigital Library
- J. M. Brandt, A. C. Gentile, D. J. Hale, and P. P. Pébay. 2006. OVIS: A Tool for Intelligent, Real-time Monitoring of Computational Clusters. In Proceedings of the 20th International Conference on Parallel and Distributed Processing (IPDPS'06). IEEE Computer Society, Washington, DC, USA, 365--365. http://dl.acm.org/citation.cfm?id=1898699.1898925 Google ScholarDigital Library
- James Carey and Philip Sanders. 2011. A Toolkit for Event Analysis and Logging. In State of the Practice Reports (SC '11). ACM, New York, NY, USA, Article 24, 7 pages. Google ScholarDigital Library
- D. Chen, N.H. Christ, C. Cristian, Z. Dong, A. Gara, K. Garg, B. Joo, C. Kim, L. Levkova, X. Liao, R.D. Mawhinney, S. Ohta, and T. Wettig. 2001. "QCDOC: A 10-teraflops Scale Computer for Lattice QCD". Nuclear Physics B - Proceedings Supplements 94, 1 (2001), 825--832. Proceedings of the XVIIIth International Symposium on Lattice Field Theory.Google ScholarCross Ref
- Narayan Desai, Rick Bradshaw, Cory Lueninghoener, Andrew Cherry, Susan Coghlan, and William Scullin. 2008. Petascale System Management Experiences. In LISA. 41--48. Google ScholarDigital Library
- Alan Gara, Matthias A Blumrich, Dong Chen, GL-T Chiu, Paul Coteus, Mark E Giampapa, Ruud A Haring, Philip Heidelberger, Dirk Hoenicke, Gerard V Kopcsay, et al. 2005. Overview of the Blue Gene/L system architecture. IBM Journal of Research and Development 49, 2.3 (2005), 195--212. Google ScholarDigital Library
- Jim E Garlick. 2012. I/O Forwarding on Livermore Computing Commodity Linux Clusters. Technical Report. Lawrence Livermore National Laboratory (LLNL), Livermore, CA.Google Scholar
- Mark Giampapa, Ralph Bellofatto, Matthias Blumrich, Dong Chen, Marc Boris Dombrowa, Alan Gara, Ruud A. Haring, Philip Heidelberger, Dirk Hoenicke, Gerard V. Kopcsay, Ben J. Nathanson, Burkhard D. Steinmacher-Burow, Martin Ohmacht, Valentina Salapura, and Pavlos Vranas. 2005. Blue Gene/L advanced diagnostics environment. IBM Journal of Research and Development 49 (March 2005), 319--332. Google ScholarDigital Library
- DB2 security model overview. IBM. Retrieved September 30, 2017 from https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.db2.luw.admin.sec.doc/doc/c0021804.htmlGoogle Scholar
- IBM Offering Catalog: Blue Gene/Q * supercomputer, Up to 16,384 cores and 16 TB mem. IBM. Retrieved September 30, 2017 from https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=OC&subtype=NA&htmlfid=897/ENUS0207-200Google Scholar
- Kamil Iskra, John W. Romein, Kazutomo Yoshii, and Pete Beckman. 2008. ZOID: I/O-forwarding Infrastructure for Petascale Architectures. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '08). ACM, New York, NY, USA, 153--162. Google ScholarDigital Library
- Stefan Lankes, Simon Pickartz, and Jens Breitbart. 2016. HermitCore: A Unikernel for Extreme Scale Computing. In Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS '16). ACM, New York, NY, USA, Article 4, 8 pages. Google ScholarDigital Library
- Robert D. Mawhinney. 1999. "The 1 Teraflops QCDSP computer". Parallel Comput. 25, 10 (1999), 1281--1296. Google ScholarDigital Library
- Ronald G. Minnich and Jim Mckie. 2009. Experiences porting the Plan 9 research operating system to the IBM Blue Gene supercomputers. Computer Science - Research and Development 23, 3 (01 Jun 2009), 117--124.Google Scholar
- IBM Redbooks. 2006. Blue Gene/L: System Administration. Vervante. Google ScholarDigital Library
- IBM Redbooks. 2009. IBM System Blue Gene Solution: Blue Gene/P System Administration. Vervante. Google ScholarDigital Library
- IBM Redbooks. 2012. IBM System Blue Gene Solution: Blue Gene/Q System Administration. Vervante. Google ScholarDigital Library
- Valentina Salapura, Karthik Ganesan, Alan Gara, Michael Gschwind, James C Sexton, and Robert E Walkup. 2008. Next-generation performance counters: Towards monitoring over thousand concurrent events. In Performance Analysis of Systems and software, 2008. ISPASS 2008. IEEE International Symposium on. IEEE, 139--146. Google ScholarDigital Library
- B. Schroeder and G. Gibson. 2010. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing 7, 4 (Oct 2010), 337--350. Google ScholarDigital Library
- Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K. Lockwood, and Nicholas J. Wright. 2016. Modular HPC I/O Characterization with Darshan. In Proceedings of the 5th Workshop on Extreme-Scale Programming Tools (ESPT '16). IEEE Press, Piscataway, NJ, USA, 9--17. Google ScholarDigital Library
- Eric Van Hensbergen, Charles Forsyth, Jim McKie, and Ron Minnich. 2008. Holistic Aggregate Resource Environment. SIGOPS Oper. Syst. Rev. 42, 1 (Jan. 2008), 85--91. Google ScholarDigital Library
- Kazutomo Yoshii, Kamil Iskra, Harish Naik, Pete Beckmanm, and P Chris Broekema. 2009. Characterizing the performance of "big memory" on blue gene linux. In Parallel Processing Workshops, 2009. ICPPW'09. International Conference on. IEEE, 65--72. Google ScholarDigital Library
- Fengwei Zhang and Hongwei Zhang. 2016. SoK: A Study of Using Hardware-assisted Isolated Execution Environments for Security. In Proceedings of the Hardware and Architectural Support for Security and Privacy 2016. ACM, 3. Google ScholarDigital Library
- Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman. 2010. A practical failure prediction with location and lead time for Blue Gene/P. In 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W). 15--22. Google ScholarDigital Library
Index Terms
- Lessons from the IBM Blue Gene Series of Supercomputers
Comments