SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

Authors:
Jinming Zhuang

University of Pittsburgh, Pittsburgh, USA

University of Pittsburgh, Pittsburgh, USA

0000-0003-3659-339X
View Profile

,
Zhuoping Yang

University of Pittsburgh, Pittsburgh, USA

University of Pittsburgh, Pittsburgh, USA

0000-0002-7655-4080
View Profile

,
Shixin Ji

University of Pittsburgh, Pittsburgh, USA

University of Pittsburgh, Pittsburgh, USA

0009-0003-3429-4692
View Profile

,
Heng Huang

University of Maryland, College Park, USA

University of Maryland, College Park, USA

0000-0002-3483-8333
View Profile

,
Alex K. Jones

University of Pittsburgh, Pittsburgh, USA

University of Pittsburgh, Pittsburgh, USA

0000-0001-7498-0206
View Profile

,
Jingtong Hu

University of Pittsburgh, Pittsburgh, USA

University of Pittsburgh, Pittsburgh, USA

0000-0003-4029-4034
View Profile

,
Yiyu Shi

University of Notre Dame, Notre Dame, USA

University of Notre Dame, Notre Dame, USA

0000-0002-6788-9823
View Profile

,
Peipei Zhou

University of Pittsburgh, Pittsburgh, USA

University of Pittsburgh, Pittsburgh, USA

0000-0002-0493-1844
View Profile

FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate ArraysApril 2024Pages 55–66https://doi.org/10.1145/3626202.3637569

Published:02 April 2024Publication History

FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

Pages 55–66

ABSTRACT

With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latency. In this work, we first systematically investigate two execution models: (1) sequentially (temporally) launch one monolithic accelerator, and (2) spatially launch multiple accelerators. From the observations, we find that there is a latency throughput tradeoff between these two execution models, and combining these two strategies together can give us a more efficient latency throughput Pareto front. To achieve this, we propose spatial sequential architecture (SSR) and SSR design automation framework to explore both strategies together when deploying deep learning inference. We use the 7nm AMD Versal ACAP VCK190 board to implement SSR accelerators for four end-to-end transformer-based deep learning models. SSR achieves average throughput gains of 2.53x, 35.71x, and 14.20x under different batch sizes compared to the 8nm Nvidia GPU A10G, 16nm AMD FPGAs ZCU102, and U250. The average energy efficiency gains are 8.51x, 6.75x, and 21.22x, respectively. Compared with the sequential-only solution and spatial-only solution on VCK190, our spatial-sequential-hybrid solutions achieve higher throughput under the same latency requirement and lower latency under the same throughput requirement. We also use SSR analytical models to demonstrate how to use SSR to optimize solutions on other computing platforms, e.g., 14nm Intel Stratix 10 NX.

References

Manouchehr Rafie. Autonomous vehicles drive ai advances for edge computing. https://www.3dincites.com/2021/07/autonomous-vehicles-drive-aiadvances- for-edge-computing/.Google Scholar
CERN. Colliding particles not cars: CERN's machine learning could help selfdriving cars, 2023. Last accessed JANUARY 25, 2023.Google Scholar
Minjia Zhang, Samyam Rajbandari,WenhanWang, Elton Zheng, Olatunji Ruwase, Jeff Rasley, Jason Li, JunhuaWang, and Yuxiong He. Accelerating large scale deep learning inference through {DeepCPU} at microsoft. In 2019 USENIX Conference on Operational Machine Learning (OpML 19), pages 5--7, 2019.Google Scholar
AMD/Xilinx. Versal: The First Adaptive Compute Acceleration Platform (ACAP)(WP505).Google Scholar
AndrewPutnam, AdrianMCaulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. A reconfigurable fabric for accelerating large-scale datacenter services. ACM SIGARCH Computer Architecture News, 42(3):13--24, 2014.Google ScholarDigital Library
AdrianMCaulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, et al. A cloud-scale acceleration architecture. In 2016 49th Annual IEEE/ACM international symposium on microarchitecture (MICRO), pages 1--13. IEEE, 2016.Google Scholar
Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, et al. Azure accelerated networking:{SmartNICs} in the public cloud. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI , pages 51--66, 2018.Google Scholar
Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. A configurable cloud-scale DNN processor for real-time AI. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 1--14. IEEE, 2018.Google ScholarDigital Library
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. Indatacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture, pages 1--12, 2017.Google ScholarDigital Library
Amazon. Aws inferentia: High performance at the lowest cost in amazon ec2 for deep learning inference.Google Scholar
Sidi Lu and Weisong Shi. Vehicle computing: Vision and challenges. Journal of Information and Intelligence, 1(1):23--35, 2023.Google ScholarCross Ref
Peipei Zhou, Jinming Zhuang, Stephen Cahoon, Yue Tang, Zhuoping Yang, Xingzhen Chen, Yiyu Shi, Jingtong Hu, and Alex K Jones. REFRESH FPGAs: Sustainable FPGA Chiplet Architectures. In 2023 14th International Green and Sustainable Computing Conference (IGSC), 2023.Google Scholar
Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wenmei Hwu, and Deming Chen. Dnnbuilder: An automated tool for building highperformance dnn hardware accelerators for fpgas. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1--8. IEEE, 2018.Google ScholarDigital Library
Yue Tang, Xinyi Zhang, Peipei Zhou, and Jingtong Hu. Ef-train: Enable efficient on-device cnn training on fpga through data reshaping for online adaptation or personalization. ACM Transactions on Design Automation of Electronic Systems (TODAES), 27(5):1--36, 2022.Google Scholar
Xinyi Zhang, YawenWu, Peipei Zhou, Xulong Tang, and Jingtong Hu. Algorithm- Hardware Co-Design of Attention Mechanism on FPGA Devices. ACM Transactions on Embedded Computing Systems (TECS), 20(5s), sep 2021.Google Scholar
Chen Wu, Jinming Zhuang, Kun Wang, and Lei He. Mp-opu: A mixed precision fpga-based overlay processor for convolutional neural networks. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), pages 33--37, 2021.Google ScholarCross Ref
Jingwei Cai, Yuchen Wei, Zuotong Wu, Sen Peng, and Kaisheng Ma. Inter-layer scheduling space definition and exploration for tiled accelerators. In Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA '23, New York, NY, USA, 2023. Association for Computing Machinery.Google ScholarDigital Library
Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W Mahoney, et al. Full stack optimization of transformer inference. In Architecture and System Support for Transformer Models (ASSYST@ ISCA 2023), 2023.Google Scholar
Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Deming Chen, Jason Cong, and Peipei Zhou. CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA '23, page 153--164, New York, NY, USA, 2023. Association for Computing Machinery.Google ScholarDigital Library
Zhuoping Yang, Jinming Zhuang, Jiaqi Yin, Cunxi Yu, Alex K Jones, and Peipei Zhou. AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), pages 1--9. IEEE, 2023.Google Scholar
Zhuoping Yang, Shixin Ji, Xingzhen Chen, Jinming Zhuang, Weifeng Zhang, Dharmesh Jani, and Peipei Zhou. Challenges and Opportunities to Enable Large- Scale Computing via Heterogeneous Chiplets. In 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC), 2024.Google ScholarDigital Library
Zheyu Yan, Xiaobo Sharon Hu, and Yiyu Shi. Computing-in-memory neural network accelerators for safety-critical systems: Can small device variations be disastrous? In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, pages 1--9, 2022.Google ScholarDigital Library
Zheyu Yan, Xiaobo Sharon Hu, and Yiyu Shi. Swim: Selective write-verify for computing-in-memory neural accelerators. In Proceedings of the 59th ACM/IEEE Design Automation Conference, pages 277--282, 2022.Google ScholarDigital Library
Zheyu Yan, Yifan Qin, Wujie Wen, Xiaobo Sharon Hu, and Yiyu Shi. Improving realistic worst-case performance of nvcim dnn accelerators through training with right-censored gaussian noise. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), pages 1--9. IEEE, 2023.Google ScholarCross Ref
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347--10357. PMLR, 2021.Google Scholar
Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(11):2072--2085, 2018.Google ScholarDigital Library
AMD. Versal AI Core Series.Google Scholar
Han Vanholder. Efficient inference with tensorrt. In GPU Technology Conference, volume 1, 2016.Google Scholar
Nvidia. Nvidia aws a10g gpu data sheet.Google Scholar
AMD/Xilinx. Versal Adaptive Compute Acceleration Platform.Google Scholar
Yassine Ghannane and Mohamed S Abdelfattah. Diviml: A module-based heuristic for mapping neural networks onto heterogeneous platforms. arXiv preprint arXiv:2308.00127, 2023.Google Scholar
Sheng-Chun Kao and Tushar Krishna. Magma: An optimization framework for mapping multiple dnns on multiple accelerator cores. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 814--830. IEEE, 2022.Google ScholarCross Ref
Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. Heterogeneous dataflow accelerators for multi-dnn workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 71--83. IEEE, 2021.Google ScholarCross Ref
Haoran You, Zhanyi Sun, Huihong Shi, Zhongzhi Yu, Yang Zhao, Yongan Zhang, Chaojian Li, Baopu Li, and Yingyan Lin. Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 273--286. IEEE, 2023.Google ScholarCross Ref
Peiyan Dong, Mengshu Sun, Alec Lu, Yanyue Xie, Kenneth Liu, Zhenglun Kong, Xin Meng, Zhengang Li, Xue Lin, Zhenman Fang, et al. Heatvit: Hardwareefficient adaptive token pruning for vision transformers. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 442--455. IEEE, 2023.Google ScholarCross Ref
Zhengang Lit, Mengshu Sun, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser, Zhangyang Wang, et al. Auto-vit-acc: An fpgaaware automatic acceleration framework for vision transformer with mixedscheme quantization. In 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), pages 109--116. IEEE, 2022.Google Scholar
Andrew Boutros, Eriko Nurvitadhi, Rui Ma, Sergey Gribok, Zhipeng Zhao, James C Hoe, Vaughn Betz, and Martin Langhammer. Beyond peak performance: Comparing the real performance of ai-optimized fpgas and gpus. In 2020 International Conference on Field-Programmable Technology (ICFPT), pages 10--19. IEEE, 2020.Google ScholarCross Ref
Xiaofan Zhang, Hanchen Ye, JunsongWang, Yonghua Lin, Jinjun Xiong,Wen-mei Hwu, and Deming Chen. Dnnexplorer: a framework for modeling and exploring a novel paradigm of fpga-based dnn accelerator. In Proceedings of the 39th International Conference on Computer-Aided Design, pages 1--9, 2020.Google ScholarDigital Library
Martín Abadi. TensorFlow: Learning Functions at Scale. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, ICFP 2016, page 1, New York, NY, USA, 2016. Association for Computing Machinery.Google Scholar
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.Google Scholar
Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In Proceedings of the 58th Annual Design Automation Conference (DAC), 2021.Google ScholarDigital Library
Jinming Zhuang, Zhuoping Yang, and Peipei Zhou. High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1--6, 2023.Google Scholar
John H Holland. Genetic algorithms. Scientific american, 267(1):66--73, 1992.Google ScholarCross Ref
Xilinx. Vitis unified software platform, 2022. Last accessed April 21, 2022.Google Scholar
Nvidia. System Management Interface SMI | NVIDIA Developer.Google Scholar
AMD. Zynq UltraScale MPSoC ZCU102 Evaluation Kit .Google Scholar
AMD. Alveo U250 Data Center Accelerator Card .Google Scholar
AMD/Xilinx. Board evaluation and management Tool.Google Scholar
Intel. Stratix10 NX FPGA.Google Scholar

Index Terms

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
      1. Hardware-software codesign

Recommendations

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture
FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have ...
Read More
MIC acceleration of short-range molecular dynamics simulations
COSMIC '13: Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores

Heterogeneous systems containing accelerators such as GPUs or co-processors such as Intel MIC are becoming more prevalent due to their ability of exploiting large-scale parallelism in applications. In this paper, we have developed a hierarchical ...
Read More
Boosting CUDA Applications with CPU---GPU Hybrid Computing

This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays
April 2024
300 pages
ISBN:9798400704185
DOI:10.1145/3626202
General Chair:
Zhiru Zhang
Cornell University, USA
,
Program Chair:
Andrew Putnam
Microsoft, USA
Copyright © 2024 Owner/Author
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 April 2024
Check for updates
Author Tags
deep learning
design space exploration
domain-specific accelerator
heterogeneous computing
transformers
versal acap
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate125of627submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 146
  Total Downloads
- Downloads (Last 12 months)146
- Downloads (Last 6 weeks)146
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture

MIC acceleration of short-range molecular dynamics simulations

Boosting CUDA Applications with CPU---GPU Hybrid Computing