ABSTRACT
Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifically, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power efficient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCM-GPU) design. We then propose three architectural optimizations that significantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.
- 2005. Xenos: XBOX360 GPU. (2005). http://fileadmin.cs.lth.se/cs/Personal/MichaelDoggett/talks/eg05-xenos-doggett.pdf Accessed: 2016-08-19.Google Scholar
- 2007. The Xeon X5365. (2007). http://ark.intel.com/products/30702/Intel-Xeon-Processor-X5365-8M-Cache-300-GHz-1333-MHz-FSB Accessed: 2016-08-19.Google Scholar
- 2011. IBM zEnterprise 196 Technical Guide. (2011). http://www.redbooks.ibm.com/redbooks/pdfs/sg247833.pdf Accessed: 2016-08-19. Google ScholarDigital Library
- 2012. AMD Server Solutions Playbook. (2012). http://www.amd.com/Documents/AMD_Opteron_ServerPlaybook.pdf Accessed: 2016-08-19.Google Scholar
- 2012. IBM Power Systems Deep Dive. (2012). http://www-05.ibm.com/cz/events/febannouncement2012/pdf/power_architecture.pdf Accessed: 2016-08-19.Google Scholar
- 2014. CORAL Benchmarks. (2014). https://asc.llnl.gov/CORAL-benchmarks/Google Scholar
- 2015. Intel Delays 10nm to 2017. (2015). http://www.extremetech.com/computing/210050-intel-confirms-10nm-delayed-to-2017-will-introduce-kaby-\lake-at-14nm-to-fill-gapGoogle Scholar
- 2015. International Technology Roadmap for Semiconductors 2.0. (2015). http://www.itrs2.net/itrs-reports.htmlGoogle Scholar
- 2015. Switch-IB 2 EDR Switch Silicon - World's First Smart Switch. (2015). http://www.mellanox.com/related-docs/prod_silicon/PB_SwitchIB2_EDR_Switch_Silicon.pdf Accessed: 2016-06-20.Google Scholar
- 2015. TESLA K80 GPU ACCELERATOR. (2015). https://images.nvidia.com/content/pdf/kepler/Tesla-K80-BoardSpec-07317-001-v05.pdf Accessed: 2016-06-20.Google Scholar
- 2015. The Compute Architecture of Intel Processor Graphics Gen8. (2015). https://software.intel.com Accessed: 2016-08-19.Google Scholar
- 2015. TOP500 Shows Growing Momentum for Accelerators. (2015). http://insidehpc.com/2015/11/top500-shows-growing-momentum-for-accelerators/ Accessed: 2016-06-20.Google Scholar
- 2016. ConnectX-4 VPI Single and Dual Port QSFP28 Adapter Card User Manual. (2016). http://www.mellanox.com/related-docs/user_manuals/ConnectX-4_VPI_Single_and_Dual_QSFP28_Port_Adapter_Card_User_Manual.pdf Accessed: 2016-06-20.Google Scholar
- 2016. Inside Pascal: NVIDIA's Newest Computing Platform. (2016). https://devblogs.nvidia.com/parallelforall/inside-pascal Accessed: 2016-06-20.Google Scholar
- 2016. NVIDIA cuDNN, GPU Accelerated Deep Learning. (2016). https://developer.nvidia.com/cudnn Accessed: 2016-11-17.Google Scholar
- 2016. NVIDIA NVLink High-Speed Interconnect. (2016). http://www.nvidia.com/object/nvlink.html Accessed: 2016-06-20.Google Scholar
- 2016. The New NVIDIA Pascal Architecture. (2016). http://www.nvidia.com/object/gpu-architecture.html Accessed: 2016-06-20.Google Scholar
- 2016. The TWINSCAN NXT:1950i Dual-Stage Immersion Lithography System. (2016). https://www.asml.com/products/systems/twinscan-nxt/twinscan-nxt1950i/en/s46772?dfp_product_id=822 Accessed: 2016-11-18.Google Scholar
- 2016. Titan: The world's #1 Open Science Super Computer. (2016). https://www.olcf.ornl.gov/titan/Google Scholar
- Tal Ben-Nun, Ely Levy, Amnon Barak, and Eri Rubin. 2015. Memory Access Patterns: The Missing Piece of the multi-GPU Puzzle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, 19:1--19:12. Google ScholarDigital Library
- Sergey Blagodurov, Alexandra Fedorova, Sergey Zhuravlev, and Ali Kamali. 2010. A case for NUMA-aware contention management on multicore systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10). IEEE, Vienna, Austria, 557--558. Google ScholarDigital Library
- William L. Bolosky, Robert P. Fitzgerald, and Michael L. Scott. 1989. Simple but Effective Techniques for NUMA Memory Management. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles (SOSP '89). ACM, New York, NY, USA, 19--31. Google ScholarDigital Library
- Javier Cabezas, Lluís Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, and Wen-mei W. Hwu. 2015. Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 3--13. Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '09). IEEE, Washington, DC, USA, 44--54. Google ScholarDigital Library
- Long Chen, Oreste Villa, and Guang R. Gao. 2011. Exploring Fine-Grained Task-Based Execution on Multi-GPU Systems. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER '11). IEEE, Washington, DC, USA, 386--394. Google ScholarDigital Library
- Long Chen, Oreste Villa, Sriram Krishnamoorthy, and Guang R. Gao. 2010. Dynamic load balancing on single- and multi-GPU systems. In Proceedings of the IEEE International Symposium on Parallel Distributed Processing (IPDPS '10). IEEE, Atlanta, GA, USA, 1--12.Google Scholar
- Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 381--394. Google ScholarDigital Library
- Daniel Dreps. 2007. The 3rd generation of IBM's elastic interface on POWER6. In Proceedings of the IEEE Hot Chips 19 Symposium (HCS '19). IEEE, 1--16.Google ScholarCross Ref
- Michael Feldman, Christopher G. Willard, and Addison Snell. 2015. HPC Application Support for GPU Computing. (2015). http://www.intersect360.com/industry/reports.php?id=131Google Scholar
- Mitsuya Ishida. 2014. Kyocera APX - An Advanced Organic Technology for 2.5D Interposers. (2014). https://www.ectc.net Accessed: 2016-06-20.Google Scholar
- Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H. Loh. 2015. Enabling Interposer-based Disintegration of Multi-core Processors. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 546--558. Google ScholarDigital Library
- Stephen W. Keckler, William J. Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the Future of Parallel Computing. IEEE Micro 31, 5 (Sept. 2011), 7--17. Google ScholarDigital Library
- Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP '11). ACM, New York, NY, USA, 277--288. Google ScholarDigital Library
- Richard P. LaRowe Jr., James T. Wilkes, and Carla S. Ellis. 1991. Exploiting Operating System Support for Dynamic Page Placement on a NUMA Shared Memory Multiprocessor. In Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP '91). ACM, New York, NY, USA, 122--132. Google ScholarDigital Library
- Andrew Lavin and Scott Gray. 2016. Fast Algorithms for Convolutional Neural Networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR '16). IEEE, Las Vegas, NV, USA, 4013--4021.Google ScholarCross Ref
- Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. 2013. Transparent CPU-GPU Collaboration for Data-parallel Kernels on Heterogeneous Systems. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). IEEE, Piscataway, NJ, USA, 245--256. http://dl.acm.org/citation.cfm?id=2523721.2523756 Google ScholarDigital Library
- Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14). IEEE, Orlando, FL, USA, 260--271.Google ScholarCross Ref
- Hui Li, Sudarsan Tandri, Michael Stumm, and Kenneth C. Sevcik. 1993. Locality and Loop Scheduling on NUMA Multiprocessors. In Proceedings of the International Conference on Parallel Processing - Volume 02 (ICPP '93). IEEE, Washington, DC, USA, 140--147. Google ScholarDigital Library
- Zoltan Majo and Thomas R. Gross. 2012. Matching Memory Access Patterns and Data Placement for NUMA Systems. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO '12). ACM, New York, NY, USA, 230--241. Google ScholarDigital Library
- Mozhgan Mansuri, James E. Jaussi, Joseph T. Kennedy, Tzu-Chien Hsueh, Sudip Shekhar, Ganesh Balamurugan, Frank O'Mahony, Clark Roberts, Randy Mooney, and Bryan Casper. 2013. A scalable 0.128-to-1Tb/s 0.8-to-2.6pJ/b 64-lane parallel I/O in 32nm CMOS. In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC '13). IEEE, San Francisco, CA, USA, 402--403.Google ScholarCross Ref
- Mengjie Mao, Wujie Wen, Xiaoxiao Liu, Jingtong Hu, Danghui Wang, Yiran Chen, and Hai Li. 2016. TEMP: Thread Batch Enabled Memory Partitioning for GPU. In Proceedings of the 53rd Annual Design Automation Conference (DAC '16). ACM, New York, NY, USA, Article 65, 6 pages. Google ScholarDigital Library
- Takuji Mitsuishi, Jun Suzuki, Yuki Hayashi, Masaki Kan, and Hideharu Amano. 2016. Breadth First Search on Cost-efficient Multi-GPU Systems. SIGARCH Comput. Archit. News 43, 4 (April 2016), 58--63. Google ScholarDigital Library
- Molly A. O'Neil and Martin Burtscher. 2014. Microarchitectural performance characterization of irregular GPU kernels. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC '14). IEEE, Raleigh, NC, USA, 130--139.Google ScholarCross Ref
- John Poulton, Robert Palmer, Andrew M. Fuller, Trey Greer, John Eyles, William J. Dally, and Mark Horowitz. 2007. A 14-mW 6.25-Gb/s Transceiver in 90-nm CMOS. IEEE Journal of Solid-State Circuits 42, 12 (Dec 2007), 2745--2757.Google ScholarCross Ref
- John W. Poulton, William J. Dally, Xi Chen, John G. Eyles, Thomas H. Greer, Stephen G. Tell, John M. Wilson, and C. Thomas Gray. 2013. A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications. IEEE Journal of Solid-State Circuits 48, 12 (Dec 2013), 3206--3218.Google ScholarCross Ref
- Debendra D. Sharma. 2014. PCI Express 3.0 Features and Requirements Gathering for beyond. (2014). https://www.openfabrics.org/downloads/Media/Monterey_2011/Apr5_pcie%20gen3.pdf Accessed: 2016-06-20.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv e-prints (Sept. 2014). arXiv:cs.CV/1409.1556Google Scholar
- Bruce W. Smith and Kazuaki Suzuki. 2007. Microlithography: Science and Technology, Second Edition. https://books.google.com/books?id=_hTLDCeIYxoCGoogle Scholar
- Jeff A. Stuart and John D. Owens. 2009. Message Passing on Data-parallel Architectures. In Proceedings of the IEEE International Symposium on Parallel&Distributed Processing (IPDPS '09). IEEE, Washington, DC, USA, 1--12. Google ScholarDigital Library
- Jeff A. Stuart and John D. Owens. 2011. Multi-GPU MapReduce on GPU Clusters. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS '11). IEEE, Washington, DC, USA, 1068--1079. Google ScholarDigital Library
- David Tam, Reza Azimi, and Michael Stumm. 2007. Thread Clustering: Sharing-aware Scheduling on SMP-CMP-SMT Multiprocessors. In Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys '07). ACM, New York, NY, USA, 47--58. Google ScholarDigital Library
- Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. 2016. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE, Piscataway, NJ, USA, 583--595. Google ScholarDigital Library
- Kenneth M. Wilson and Bob B. Aglietti. 2001. Dynamic Page Placement to Improve Locality in CC-NUMA Multiprocessors for TPC-C. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC '01). ACM, New York, NY, USA, 33--33. Google ScholarDigital Library
Index Terms
- MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability
Recommendations
Beyond the socket: NUMA-aware GPUs
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on MicroarchitectureGPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism ...
MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability
ISCA'17Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single ...
Experience Applying Fortran GPU Compilers to Numerical Weather Prediction
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingGraphics Processing Units (GPUs) have enabled significant improvements in computational performance compared to traditional CPUs in several application domains. Until recently, GPUs have been programmed using C/C++ based methods such as CUDA (NVIDIA) ...
Comments