skip to main content
research-article

In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms

Authors Info & Claims
Published:17 February 2019Publication History
Skip Abstract Section

Abstract

Conventional homogeneous multicore processors are not able to provide the continued performance and energy improvement that we have expected from past endeavors. Heterogeneous architectures that feature specialized hardware accelerators are widely considered a promising paradigm for resolving this issue. Among different heterogeneous devices, FPGAs that can be reconfigured to accelerate a broad class of applications with orders-of-magnitude performance/watt gains, are attracting increased attention from both academia and industry. As a consequence, a variety of CPU-FPGA acceleration platforms with diversified microarchitectural features have been supplied by industry vendors. Such diversity, however, poses a serious challenge to application developers in selecting the appropriate platform for a specific application or application domain.

This article aims to address this challenge by determining which microarchitectural characteristics affect performance, and in what ways. Specifically, we conduct a quantitative comparison and an in-depth analysis on five state-of-the-art CPU-FPGA acceleration platforms: (1) the Alpha Data board and (2) the Amazon F1 instance that represent the traditional PCIe-based platform with private device memory; (3) the IBM CAPI that represents the PCIe-based system with coherent shared memory; (4) the first generation of the Intel Xeon+FPGA Accelerator Platform that represents the QPI-based system with coherent shared memory; and (5) the second generation of the Intel Xeon+FPGA Accelerator Platform that represents a hybrid PCIe-based (non-coherent) and QPI-based (coherent) system with shared memory. Based on the analysis of their CPU-FPGA communication latency and bandwidth characteristics, we provide a series of insights for both application developers and platform designers. Furthermore, we conduct two case studies to demonstrate how these insights can be leveraged to optimize accelerator designs. The microbenchmarks used for evaluation have been released for public use.

References

  1. Jeff Burt. 2016. Intel to Start Shipping Xeons with FPGAs in Early 2016. Retrieved from http://www.eweek.com/servers/intel-to-start-shipping-xeons-with-fpgas-in-early-2016.html.Google ScholarGoogle Scholar
  2. Amazon. 2017. Amazon EC2 F1 Instance. Retrieved from https://aws.amazon.com/ec2/instance-types/f1/.Google ScholarGoogle Scholar
  3. Xilinx. 2017. SDAccel Development Environment. Retrieved from http://www.xilinx.com/products/design-tools/software-zone/sdaccel.html.Google ScholarGoogle Scholar
  4. CCIX. 2018. Cache Coherent Interconnect for Accelerators. Retrieved from https://www.ccixconsortium.com/.Google ScholarGoogle Scholar
  5. Brad Brech, Juan Rubio, and Michael Hollinger. 2015. IBM Data Engine for NoSQL—Power Systems Edition. Technical Report. IBM Systems Group.Google ScholarGoogle Scholar
  6. Tony M. Brewer. 2010. Instruction set innovations for the Convey HC-1 computer. IEEE Micro 2 (2010), 70--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Nanchini Chandramoorthy, Giuseppe Tagliavini, Kevin Irick, Antonio Pullini, Siddharth Advani, Sulaiman Al Habsi, Matthew Cotter, John Sampson, Vijaykrishnan Narayanan, and Luca Benini. 2015. Exploring architectural heterogeneity in intelligent vision systems. In HPCA-21.Google ScholarGoogle Scholar
  8. Yu-Ting Chen, Jason Cong, Zhenman Fang, Jie Lei, and Peng Wei. 2016. When Apache spark meets FPGAs: A case study for next-generation DNA sequencing acceleration. In HotCloud. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Young-kyu Choi and Jason Cong. 2016. Acceleration of EM-based 3D CT reconstruction using FPGA. IEEE Trans. Biomed. Circ. Syst. 10, 3 (2016), 754--767.Google ScholarGoogle ScholarCross RefCross Ref
  10. Young-kyu Choi, Jason Cong, Zhenman Fang, Yuchen Hao, Glenn Reinman, and Peng Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC-53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jason Cong, Zhenman Fang, Yuchen Hao, and Glenn Reinman. 2017. Supporting address translation for accelerator-centric architectures. In HPCA-23.Google ScholarGoogle Scholar
  12. Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Hui Huang, and Glenn Reinman. 2013. Composable accelerator-rich microprocessor enhanced for adaptivity and longevity. In ISLPED. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jason Cong, Peng Wei, and Cody Hao Yu. 2018. From JVM to FPGA: Bridging abstraction hierarchy via optimized deep pipelining. In HotCloud. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Shane Cook. 2012. CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs. Newnes. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An analysis of accelerator coupling in heterogeneous architectures. In DAC-52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Zhenman Fang, Sanyam Mehta, Pen-Chung Yew, Antonia Zhai, James Greensky, Gautham Beeraka, and Binyu Zang. 2015. Measuring microarchitectural details of multi- and many-core memory systems through microbenchmarking. ACM Trans. Architect. Code Optimiz. 11, 4 (2015), 55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. IBM 2015. Coherent Accelerator Processor Interface User’s Manual Xilinx Edition. IBM. Rev. 1.1.Google ScholarGoogle Scholar
  18. Intel 2016. BDW+FPGA Beta Release 5.0.3 Core Cache Interface (CCI-P) Interface Specification. Intel. Rev. 1.0.Google ScholarGoogle Scholar
  19. J. Jang, S. Choi, and V. Prasanna. 2005. Energy- and time-efficient matrix multiplication on FPGAs. IEEE TVLSI 13, 11 (2005), 1305--1319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jason Lawley. 2014. Understanding Performance of PCI Express Systems. Xilinx. Rev. 1.2.Google ScholarGoogle Scholar
  21. NVIDIA 2009. NVIDIA’s Next Generation CUDA Compute Architecture: FERMI. NVIDIA. Rev. 1.1.Google ScholarGoogle Scholar
  22. Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia et al. 2011. A reconfigurable computing system based on a cache-coherent fabric. In ReConFig. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. 2015. Toward accelerating deep learning at scale using specialized hardware in the datacenter. In Hot Chips.Google ScholarGoogle Scholar
  24. Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In ISCA-41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Phil Rogers. 2013. Heterogeneous system architecture overview. In Hot Chips.Google ScholarGoogle Scholar
  26. Zhenyuan Ruan, Tong He, Bojie Li, Peipei Zhou, and Jason Cong. 2018. ST-Accel: A high-level programming platform for streaming applications on FPGA. In FCCM.Google ScholarGoogle Scholar
  27. J. Stuecheli, Bart Blaner, C. R. Johns, and M. S. Siegel. 2015. CAPI: A coherent accelerator processor interface. IBM J. Res. Dev. 59, 1 (2015), 7:1–7:7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS.Google ScholarGoogle Scholar
  30. Xilinx 2017. ADM-PCIE-7V3 Datasheet. Xilinx. Rev. 1.3.Google ScholarGoogle Scholar
  31. Serif Yesil, Muhammet Mustafa Ozdal, Taemin Kim, Andrey Ayupov, Steven Burns, and Ozcan Ozturk. 2015. Hardware accelerator design for data centers. In ICCAD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In FPGA. 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 12, Issue 1
      March 2019
      115 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/3310278
      • Editor:
      • Deming Chen
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 February 2019
      • Accepted: 1 November 2018
      • Revised: 1 July 2018
      • Received: 1 March 2018
      Published in trets Volume 12, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format