Abstract
Conventional homogeneous multicore processors are not able to provide the continued performance and energy improvement that we have expected from past endeavors. Heterogeneous architectures that feature specialized hardware accelerators are widely considered a promising paradigm for resolving this issue. Among different heterogeneous devices, FPGAs that can be reconfigured to accelerate a broad class of applications with orders-of-magnitude performance/watt gains, are attracting increased attention from both academia and industry. As a consequence, a variety of CPU-FPGA acceleration platforms with diversified microarchitectural features have been supplied by industry vendors. Such diversity, however, poses a serious challenge to application developers in selecting the appropriate platform for a specific application or application domain.
This article aims to address this challenge by determining which microarchitectural characteristics affect performance, and in what ways. Specifically, we conduct a quantitative comparison and an in-depth analysis on five state-of-the-art CPU-FPGA acceleration platforms: (1) the Alpha Data board and (2) the Amazon F1 instance that represent the traditional PCIe-based platform with private device memory; (3) the IBM CAPI that represents the PCIe-based system with coherent shared memory; (4) the first generation of the Intel Xeon+FPGA Accelerator Platform that represents the QPI-based system with coherent shared memory; and (5) the second generation of the Intel Xeon+FPGA Accelerator Platform that represents a hybrid PCIe-based (non-coherent) and QPI-based (coherent) system with shared memory. Based on the analysis of their CPU-FPGA communication latency and bandwidth characteristics, we provide a series of insights for both application developers and platform designers. Furthermore, we conduct two case studies to demonstrate how these insights can be leveraged to optimize accelerator designs. The microbenchmarks used for evaluation have been released for public use.
- Jeff Burt. 2016. Intel to Start Shipping Xeons with FPGAs in Early 2016. Retrieved from http://www.eweek.com/servers/intel-to-start-shipping-xeons-with-fpgas-in-early-2016.html.Google Scholar
- Amazon. 2017. Amazon EC2 F1 Instance. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.Google Scholar
- Xilinx. 2017. SDAccel Development Environment. Retrieved from http://www.xilinx.com/products/design-tools/software-zone/sdaccel.html.Google Scholar
- CCIX. 2018. Cache Coherent Interconnect for Accelerators. Retrieved from https://www.ccixconsortium.com/.Google Scholar
- Brad Brech, Juan Rubio, and Michael Hollinger. 2015. IBM Data Engine for NoSQL—Power Systems Edition. Technical Report. IBM Systems Group.Google Scholar
- Tony M. Brewer. 2010. Instruction set innovations for the Convey HC-1 computer. IEEE Micro 2 (2010), 70--79. Google ScholarDigital Library
- Nanchini Chandramoorthy, Giuseppe Tagliavini, Kevin Irick, Antonio Pullini, Siddharth Advani, Sulaiman Al Habsi, Matthew Cotter, John Sampson, Vijaykrishnan Narayanan, and Luca Benini. 2015. Exploring architectural heterogeneity in intelligent vision systems. In HPCA-21.Google Scholar
- Yu-Ting Chen, Jason Cong, Zhenman Fang, Jie Lei, and Peng Wei. 2016. When Apache spark meets FPGAs: A case study for next-generation DNA sequencing acceleration. In HotCloud. Google ScholarDigital Library
- Young-kyu Choi and Jason Cong. 2016. Acceleration of EM-based 3D CT reconstruction using FPGA. IEEE Trans. Biomed. Circ. Syst. 10, 3 (2016), 754--767.Google ScholarCross Ref
- Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC-53. Google ScholarDigital Library
- Jason Cong, Zhenman Fang, Yuchen Hao, and Glenn Reinman. 2017. Supporting address translation for accelerator-centric architectures. In HPCA-23.Google Scholar
- Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Hui Huang, and Glenn Reinman. 2013. Composable accelerator-rich microprocessor enhanced for adaptivity and longevity. In ISLPED. Google ScholarDigital Library
- Jason Cong, Peng Wei, and Cody Hao Yu. 2018. From JVM to FPGA: Bridging abstraction hierarchy via optimized deep pipelining. In HotCloud. Google ScholarDigital Library
- Shane Cook. 2012. CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs. Newnes. Google ScholarDigital Library
- Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An analysis of accelerator coupling in heterogeneous architectures. In DAC-52. Google ScholarDigital Library
- Zhenman Fang, Sanyam Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. 2015. Measuring microarchitectural details of multi- and many-core memory systems through microbenchmarking. ACM Trans. Architect. Code Optimiz. 11, 4 (2015), 55. Google ScholarDigital Library
- IBM 2015. Coherent Accelerator Processor Interface User’s Manual Xilinx Edition. IBM. Rev. 1.1.Google Scholar
- Intel 2016. BDW+FPGA Beta Release 5.0.3 Core Cache Interface (CCI-P) Interface Specification. Intel. Rev. 1.0.Google Scholar
- J. Jang, S. Choi, and V. Prasanna. 2005. Energy- and time-efficient matrix multiplication on FPGAs. IEEE TVLSI 13, 11 (2005), 1305--1319. Google ScholarDigital Library
- Jason Lawley. 2014. Understanding Performance of PCI Express Systems. Xilinx. Rev. 1.2.Google Scholar
- NVIDIA 2009. NVIDIA’s Next Generation CUDA Compute Architecture: FERMI. NVIDIA. Rev. 1.1.Google Scholar
- Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia et al. 2011. A reconfigurable computing system based on a cache-coherent fabric. In ReConFig. Google ScholarDigital Library
- Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. 2015. Toward accelerating deep learning at scale using specialized hardware in the datacenter. In Hot Chips.Google Scholar
- Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In ISCA-41. Google ScholarDigital Library
- Phil Rogers. 2013. Heterogeneous system architecture overview. In Hot Chips.Google Scholar
- Zhenyuan Ruan, Tong He, Bojie Li, Peipei Zhou, and Jason Cong. 2018. ST-Accel: A high-level programming platform for streaming applications on FPGA. In FCCM.Google Scholar
- J. Stuecheli, Bart Blaner, C. R. Johns, and M. S. Siegel. 2015. CAPI: A coherent accelerator processor interface. IBM J. Res. Dev. 59, 1 (2015), 7:1–7:7. Google ScholarDigital Library
- Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76. Google ScholarDigital Library
- Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS.Google Scholar
- Xilinx 2017. ADM-PCIE-7V3 Datasheet. Xilinx. Rev. 1.3.Google Scholar
- Serif Yesil, Muhammet Mustafa Ozdal, Taemin Kim, Andrey Ayupov, Steven Burns, and Ozcan Ozturk. 2015. Hardware accelerator design for data centers. In ICCAD. Google ScholarDigital Library
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In FPGA. 161--170. Google ScholarDigital Library
Index Terms
- In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms
Recommendations
A quantitative analysis on microarchitectures of modern CPU-FPGA platforms
DAC '16: Proceedings of the 53rd Annual Design Automation ConferenceCPU-FPGA heterogeneous acceleration platforms have shown great potential for continued performance and energy efficiency improvement for modern data centers, and have captured great attention from both academia and industry. However, it is nontrivial ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Optimized HPL for AMD GPU and multi-core CPU usage
The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Comments