skip to main content
10.1145/3626202.3637569acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article
Open Access

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

Published:02 April 2024Publication History

ABSTRACT

With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latency. In this work, we first systematically investigate two execution models: (1) sequentially (temporally) launch one monolithic accelerator, and (2) spatially launch multiple accelerators. From the observations, we find that there is a latency throughput tradeoff between these two execution models, and combining these two strategies together can give us a more efficient latency throughput Pareto front. To achieve this, we propose spatial sequential architecture (SSR) and SSR design automation framework to explore both strategies together when deploying deep learning inference. We use the 7nm AMD Versal ACAP VCK190 board to implement SSR accelerators for four end-to-end transformer-based deep learning models. SSR achieves average throughput gains of 2.53x, 35.71x, and 14.20x under different batch sizes compared to the 8nm Nvidia GPU A10G, 16nm AMD FPGAs ZCU102, and U250. The average energy efficiency gains are 8.51x, 6.75x, and 21.22x, respectively. Compared with the sequential-only solution and spatial-only solution on VCK190, our spatial-sequential-hybrid solutions achieve higher throughput under the same latency requirement and lower latency under the same throughput requirement. We also use SSR analytical models to demonstrate how to use SSR to optimize solutions on other computing platforms, e.g., 14nm Intel Stratix 10 NX.

References

  1. Manouchehr Rafie. Autonomous vehicles drive ai advances for edge computing. https://www.3dincites.com/2021/07/autonomous-vehicles-drive-aiadvances- for-edge-computing/.Google ScholarGoogle Scholar
  2. CERN. Colliding particles not cars: CERN's machine learning could help selfdriving cars, 2023. Last accessed JANUARY 25, 2023.Google ScholarGoogle Scholar
  3. Minjia Zhang, Samyam Rajbandari,WenhanWang, Elton Zheng, Olatunji Ruwase, Jeff Rasley, Jason Li, JunhuaWang, and Yuxiong He. Accelerating large scale deep learning inference through {DeepCPU} at microsoft. In 2019 USENIX Conference on Operational Machine Learning (OpML 19), pages 5--7, 2019.Google ScholarGoogle Scholar
  4. AMD/Xilinx. Versal: The First Adaptive Compute Acceleration Platform (ACAP)(WP505).Google ScholarGoogle Scholar
  5. AndrewPutnam, AdrianMCaulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. A reconfigurable fabric for accelerating large-scale datacenter services. ACM SIGARCH Computer Architecture News, 42(3):13--24, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. AdrianMCaulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, et al. A cloud-scale acceleration architecture. In 2016 49th Annual IEEE/ACM international symposium on microarchitecture (MICRO), pages 1--13. IEEE, 2016.Google ScholarGoogle Scholar
  7. Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, et al. Azure accelerated networking:{SmartNICs} in the public cloud. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI , pages 51--66, 2018.Google ScholarGoogle Scholar
  8. Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, et al. A configurable cloud-scale DNN processor for real-time AI. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 1--14. IEEE, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. Indatacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture, pages 1--12, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Amazon. Aws inferentia: High performance at the lowest cost in amazon ec2 for deep learning inference.Google ScholarGoogle Scholar
  11. Sidi Lu and Weisong Shi. Vehicle computing: Vision and challenges. Journal of Information and Intelligence, 1(1):23--35, 2023.Google ScholarGoogle ScholarCross RefCross Ref
  12. Peipei Zhou, Jinming Zhuang, Stephen Cahoon, Yue Tang, Zhuoping Yang, Xingzhen Chen, Yiyu Shi, Jingtong Hu, and Alex K Jones. REFRESH FPGAs: Sustainable FPGA Chiplet Architectures. In 2023 14th International Green and Sustainable Computing Conference (IGSC), 2023.Google ScholarGoogle Scholar
  13. Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wenmei Hwu, and Deming Chen. Dnnbuilder: An automated tool for building highperformance dnn hardware accelerators for fpgas. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1--8. IEEE, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yue Tang, Xinyi Zhang, Peipei Zhou, and Jingtong Hu. Ef-train: Enable efficient on-device cnn training on fpga through data reshaping for online adaptation or personalization. ACM Transactions on Design Automation of Electronic Systems (TODAES), 27(5):1--36, 2022.Google ScholarGoogle Scholar
  15. Xinyi Zhang, YawenWu, Peipei Zhou, Xulong Tang, and Jingtong Hu. Algorithm- Hardware Co-Design of Attention Mechanism on FPGA Devices. ACM Transactions on Embedded Computing Systems (TECS), 20(5s), sep 2021.Google ScholarGoogle Scholar
  16. Chen Wu, Jinming Zhuang, Kun Wang, and Lei He. Mp-opu: A mixed precision fpga-based overlay processor for convolutional neural networks. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), pages 33--37, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jingwei Cai, Yuchen Wei, Zuotong Wu, Sen Peng, and Kaisheng Ma. Inter-layer scheduling space definition and exploration for tiled accelerators. In Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA '23, New York, NY, USA, 2023. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W Mahoney, et al. Full stack optimization of transformer inference. In Architecture and System Support for Transformer Models (ASSYST@ ISCA 2023), 2023.Google ScholarGoogle Scholar
  19. Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Deming Chen, Jason Cong, and Peipei Zhou. CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA '23, page 153--164, New York, NY, USA, 2023. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Zhuoping Yang, Jinming Zhuang, Jiaqi Yin, Cunxi Yu, Alex K Jones, and Peipei Zhou. AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), pages 1--9. IEEE, 2023.Google ScholarGoogle Scholar
  21. Zhuoping Yang, Shixin Ji, Xingzhen Chen, Jinming Zhuang, Weifeng Zhang, Dharmesh Jani, and Peipei Zhou. Challenges and Opportunities to Enable Large- Scale Computing via Heterogeneous Chiplets. In 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC), 2024.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Zheyu Yan, Xiaobo Sharon Hu, and Yiyu Shi. Computing-in-memory neural network accelerators for safety-critical systems: Can small device variations be disastrous? In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, pages 1--9, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Zheyu Yan, Xiaobo Sharon Hu, and Yiyu Shi. Swim: Selective write-verify for computing-in-memory neural accelerators. In Proceedings of the 59th ACM/IEEE Design Automation Conference, pages 277--282, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Zheyu Yan, Yifan Qin, Wujie Wen, Xiaobo Sharon Hu, and Yiyu Shi. Improving realistic worst-case performance of nvcim dnn accelerators through training with right-censored gaussian noise. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), pages 1--9. IEEE, 2023.Google ScholarGoogle ScholarCross RefCross Ref
  25. Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347--10357. PMLR, 2021.Google ScholarGoogle Scholar
  26. Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(11):2072--2085, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. AMD. Versal AI Core Series.Google ScholarGoogle Scholar
  28. Han Vanholder. Efficient inference with tensorrt. In GPU Technology Conference, volume 1, 2016.Google ScholarGoogle Scholar
  29. Nvidia. Nvidia aws a10g gpu data sheet.Google ScholarGoogle Scholar
  30. AMD/Xilinx. Versal Adaptive Compute Acceleration Platform.Google ScholarGoogle Scholar
  31. Yassine Ghannane and Mohamed S Abdelfattah. Diviml: A module-based heuristic for mapping neural networks onto heterogeneous platforms. arXiv preprint arXiv:2308.00127, 2023.Google ScholarGoogle Scholar
  32. Sheng-Chun Kao and Tushar Krishna. Magma: An optimization framework for mapping multiple dnns on multiple accelerator cores. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 814--830. IEEE, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  33. Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. Heterogeneous dataflow accelerators for multi-dnn workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 71--83. IEEE, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  34. Haoran You, Zhanyi Sun, Huihong Shi, Zhongzhi Yu, Yang Zhao, Yongan Zhang, Chaojian Li, Baopu Li, and Yingyan Lin. Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 273--286. IEEE, 2023.Google ScholarGoogle ScholarCross RefCross Ref
  35. Peiyan Dong, Mengshu Sun, Alec Lu, Yanyue Xie, Kenneth Liu, Zhenglun Kong, Xin Meng, Zhengang Li, Xue Lin, Zhenman Fang, et al. Heatvit: Hardwareefficient adaptive token pruning for vision transformers. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 442--455. IEEE, 2023.Google ScholarGoogle ScholarCross RefCross Ref
  36. Zhengang Lit, Mengshu Sun, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser, Zhangyang Wang, et al. Auto-vit-acc: An fpgaaware automatic acceleration framework for vision transformer with mixedscheme quantization. In 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), pages 109--116. IEEE, 2022.Google ScholarGoogle Scholar
  37. Andrew Boutros, Eriko Nurvitadhi, Rui Ma, Sergey Gribok, Zhipeng Zhao, James C Hoe, Vaughn Betz, and Martin Langhammer. Beyond peak performance: Comparing the real performance of ai-optimized fpgas and gpus. In 2020 International Conference on Field-Programmable Technology (ICFPT), pages 10--19. IEEE, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  38. Xiaofan Zhang, Hanchen Ye, JunsongWang, Yonghua Lin, Jinjun Xiong,Wen-mei Hwu, and Deming Chen. Dnnexplorer: a framework for modeling and exploring a novel paradigm of fpga-based dnn accelerator. In Proceedings of the 39th International Conference on Computer-Aided Design, pages 1--9, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Martín Abadi. TensorFlow: Learning Functions at Scale. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, ICFP 2016, page 1, New York, NY, USA, 2016. Association for Computing Machinery.Google ScholarGoogle Scholar
  40. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.Google ScholarGoogle Scholar
  41. Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, and Yakun Sophia Shao. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In Proceedings of the 58th Annual Design Automation Conference (DAC), 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jinming Zhuang, Zhuoping Yang, and Peipei Zhou. High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1--6, 2023.Google ScholarGoogle Scholar
  43. John H Holland. Genetic algorithms. Scientific american, 267(1):66--73, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  44. Xilinx. Vitis unified software platform, 2022. Last accessed April 21, 2022.Google ScholarGoogle Scholar
  45. Nvidia. System Management Interface SMI | NVIDIA Developer.Google ScholarGoogle Scholar
  46. AMD. Zynq UltraScale MPSoC ZCU102 Evaluation Kit .Google ScholarGoogle Scholar
  47. AMD. Alveo U250 Data Center Accelerator Card .Google ScholarGoogle Scholar
  48. AMD/Xilinx. Board evaluation and management Tool.Google ScholarGoogle Scholar
  49. Intel. Stratix10 NX FPGA.Google ScholarGoogle Scholar

Index Terms

  1. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays
        April 2024
        300 pages
        ISBN:9798400704185
        DOI:10.1145/3626202

        Copyright © 2024 Owner/Author

        This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 April 2024

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate125of627submissions,20%
      • Article Metrics

        • Downloads (Last 12 months)146
        • Downloads (Last 6 weeks)146

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader