research-article

In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms

Authors:
Young-Kyu Choi

Center for Domain-Specific Computing, University of California, Los Angeles, CA, USA

Center for Domain-Specific Computing, University of California, Los Angeles, CA, USA
View Profile

,
Jason Cong

Center for Domain-Specific Computing, University of California, Los Angeles, CA, USA

Center for Domain-Specific Computing, University of California, Los Angeles, CA, USA
View Profile

,
Zhenman Fang

Center for Domain-Specific Computing, University of California, Los Angeles, CA, USA

Center for Domain-Specific Computing, University of California, Los Angeles, CA, USA
View Profile

,
Yuchen Hao

Center for Domain-Specific Computing, University of California, Los Angeles, CA, USA

Center for Domain-Specific Computing, University of California, Los Angeles, CA, USA
View Profile

,
Glenn Reinman

Center for Domain-Specific Computing, University of California, Los Angeles, CA, USA

Center for Domain-Specific Computing, University of California, Los Angeles, CA, USA
View Profile

,
Peng Wei

Center for Domain-Specific Computing, University of California, Los Angeles, CA, USA

Center for Domain-Specific Computing, University of California, Los Angeles, CA, USA
View Profile

ACM Transactions on Reconfigurable Technology and Systems Volume 12 Issue 1Article No.: 4pp 1–20https://doi.org/10.1145/3294054

Published:17 February 2019Publication History

ACM Transactions on Reconfigurable Technology and Systems

Abstract

Conventional homogeneous multicore processors are not able to provide the continued performance and energy improvement that we have expected from past endeavors. Heterogeneous architectures that feature specialized hardware accelerators are widely considered a promising paradigm for resolving this issue. Among different heterogeneous devices, FPGAs that can be reconfigured to accelerate a broad class of applications with orders-of-magnitude performance/watt gains, are attracting increased attention from both academia and industry. As a consequence, a variety of CPU-FPGA acceleration platforms with diversified microarchitectural features have been supplied by industry vendors. Such diversity, however, poses a serious challenge to application developers in selecting the appropriate platform for a specific application or application domain.

This article aims to address this challenge by determining which microarchitectural characteristics affect performance, and in what ways. Specifically, we conduct a quantitative comparison and an in-depth analysis on five state-of-the-art CPU-FPGA acceleration platforms: (1) the Alpha Data board and (2) the Amazon F1 instance that represent the traditional PCIe-based platform with private device memory; (3) the IBM CAPI that represents the PCIe-based system with coherent shared memory; (4) the first generation of the Intel Xeon+FPGA Accelerator Platform that represents the QPI-based system with coherent shared memory; and (5) the second generation of the Intel Xeon+FPGA Accelerator Platform that represents a hybrid PCIe-based (non-coherent) and QPI-based (coherent) system with shared memory. Based on the analysis of their CPU-FPGA communication latency and bandwidth characteristics, we provide a series of insights for both application developers and platform designers. Furthermore, we conduct two case studies to demonstrate how these insights can be leveraged to optimize accelerator designs. The microbenchmarks used for evaluation have been released for public use.

References

Jeff Burt. 2016. Intel to Start Shipping Xeons with FPGAs in Early 2016. Retrieved from http://www.eweek.com/servers/intel-to-start-shipping-xeons-with-fpgas-in-early-2016.html.Google Scholar
Amazon. 2017. Amazon EC2 F1 Instance. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.Google Scholar
Xilinx. 2017. SDAccel Development Environment. Retrieved from http://www.xilinx.com/products/design-tools/software-zone/sdaccel.html.Google Scholar
CCIX. 2018. Cache Coherent Interconnect for Accelerators. Retrieved from https://www.ccixconsortium.com/.Google Scholar
Brad Brech, Juan Rubio, and Michael Hollinger. 2015. IBM Data Engine for NoSQL—Power Systems Edition. Technical Report. IBM Systems Group.Google Scholar
Tony M. Brewer. 2010. Instruction set innovations for the Convey HC-1 computer. IEEE Micro 2 (2010), 70--79. Google ScholarDigital Library
Nanchini Chandramoorthy, Giuseppe Tagliavini, Kevin Irick, Antonio Pullini, Siddharth Advani, Sulaiman Al Habsi, Matthew Cotter, John Sampson, Vijaykrishnan Narayanan, and Luca Benini. 2015. Exploring architectural heterogeneity in intelligent vision systems. In HPCA-21.Google Scholar
Yu-Ting Chen, Jason Cong, Zhenman Fang, Jie Lei, and Peng Wei. 2016. When Apache spark meets FPGAs: A case study for next-generation DNA sequencing acceleration. In HotCloud. Google ScholarDigital Library
Young-kyu Choi and Jason Cong. 2016. Acceleration of EM-based 3D CT reconstruction using FPGA. IEEE Trans. Biomed. Circ. Syst. 10, 3 (2016), 754--767.Google ScholarCross Ref
Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC-53. Google ScholarDigital Library
Jason Cong, Zhenman Fang, Yuchen Hao, and Glenn Reinman. 2017. Supporting address translation for accelerator-centric architectures. In HPCA-23.Google Scholar
Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Hui Huang, and Glenn Reinman. 2013. Composable accelerator-rich microprocessor enhanced for adaptivity and longevity. In ISLPED. Google ScholarDigital Library
Jason Cong, Peng Wei, and Cody Hao Yu. 2018. From JVM to FPGA: Bridging abstraction hierarchy via optimized deep pipelining. In HotCloud. Google ScholarDigital Library
Shane Cook. 2012. CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs. Newnes. Google ScholarDigital Library
Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An analysis of accelerator coupling in heterogeneous architectures. In DAC-52. Google ScholarDigital Library
Zhenman Fang, Sanyam Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. 2015. Measuring microarchitectural details of multi- and many-core memory systems through microbenchmarking. ACM Trans. Architect. Code Optimiz. 11, 4 (2015), 55. Google ScholarDigital Library
IBM 2015. Coherent Accelerator Processor Interface User’s Manual Xilinx Edition. IBM. Rev. 1.1.Google Scholar
Intel 2016. BDW+FPGA Beta Release 5.0.3 Core Cache Interface (CCI-P) Interface Specification. Intel. Rev. 1.0.Google Scholar
J. Jang, S. Choi, and V. Prasanna. 2005. Energy- and time-efficient matrix multiplication on FPGAs. IEEE TVLSI 13, 11 (2005), 1305--1319. Google ScholarDigital Library
Jason Lawley. 2014. Understanding Performance of PCI Express Systems. Xilinx. Rev. 1.2.Google Scholar
NVIDIA 2009. NVIDIA’s Next Generation CUDA Compute Architecture: FERMI. NVIDIA. Rev. 1.1.Google Scholar
Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia et al. 2011. A reconfigurable computing system based on a cache-coherent fabric. In ReConFig. Google ScholarDigital Library
Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. 2015. Toward accelerating deep learning at scale using specialized hardware in the datacenter. In Hot Chips.Google Scholar
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In ISCA-41. Google ScholarDigital Library
Phil Rogers. 2013. Heterogeneous system architecture overview. In Hot Chips.Google Scholar
Zhenyuan Ruan, Tong He, Bojie Li, Peipei Zhou, and Jason Cong. 2018. ST-Accel: A high-level programming platform for streaming applications on FPGA. In FCCM.Google Scholar
J. Stuecheli, Bart Blaner, C. R. Johns, and M. S. Siegel. 2015. CAPI: A coherent accelerator processor interface. IBM J. Res. Dev. 59, 1 (2015), 7:1–7:7. Google ScholarDigital Library
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76. Google ScholarDigital Library
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS.Google Scholar
Xilinx 2017. ADM-PCIE-7V3 Datasheet. Xilinx. Rev. 1.3.Google Scholar
Serif Yesil, Muhammet Mustafa Ozdal, Taemin Kim, Andrey Ayupov, Steven Burns, and Ozcan Ozturk. 2015. Hardware accelerator design for data centers. In ICCAD. Google ScholarDigital Library
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In FPGA. 161--170. Google ScholarDigital Library

Index Terms

In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems

Recommendations

A quantitative analysis on microarchitectures of modern CPU-FPGA platforms
DAC '16: Proceedings of the 53rd Annual Design Automation Conference

CPU-FPGA heterogeneous acceleration platforms have shown great potential for continued performance and energy efficiency improvement for modern data centers, and have captured great attention from both academia and industry. However, it is nontrivial ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC ( http://csc.uni-frankfurt.de/csc/__ __51 ) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Reconfigurable Technology and Systems Volume 12, Issue 1
March 2019
115 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3310278
Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 February 2019
- Accepted: 1 November 2018
- Revised: 1 July 2018
- Received: 1 March 2018
Published in trets Volume 12, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
AWS F1
CAPI
CPU-FPGA platform
Heterogeneous computing
Xeon+FPGA
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 995
  Total Downloads
- Downloads (Last 12 months)115
- Downloads (Last 6 weeks)23
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

A quantitative analysis on microarchitectures of modern CPU-FPGA platforms

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Optimized HPL for AMD GPU and multi-core CPU usage

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

A quantitative analysis on microarchitectures of modern CPU-FPGA platforms

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Optimized HPL for AMD GPU and multi-core CPU usage

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media