skip to main content
10.1145/3453688.3461739acmconferencesArticle/Chapter ViewAbstractPublication PagesglsvlsiConference Proceedingsconference-collections
research-article

Accommodating Transformer onto FPGA: Coupling the Balanced Model Compression and FPGA-Implementation Optimization

Published:22 June 2021Publication History

ABSTRACT

Recently, Transformers gradually gain popularity and perform outstanding for many Natural Language Processing (NLP) tasks. However, Transformers suffer from heavy computation and memory footprint, making it difficult to deploy on embedded devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its advantages. However, the trained Transformer models are too large to accommodate to an FPGA fabric. To accommodate Transformer onto FPGA and achieve efficient execution, we propose an acceleration framework coupling the balanced model compression at the algorithm level and FPGA-implementation optimization at the hardware level. At algorithm level, we adopt a block-balanced pruning and propose an efficient sparse matrix storage format for this pruning technique, named Compressed Block Row (CBR). At the hardware level, we design an accelerator for sparse model. And we also abstract a performance analytic model to evaluate the performance of accelerator. Experiments show that our CBR format perform better than general formats and can significantly save storage space. And our accelerator can achieve $38\times$ and $1.93\times$ speedup compared to other works on CPU and GPU respectively.

Skip Supplemental Material Section

Supplemental Material

GLSVLSI21-vlsi12s.mp4

mp4

85.7 MB

References

  1. Rb Et Al. 1995. Templates for Solution of Linear Systems: Building Blocks for Iterative Methods. Siam (1995).Google ScholarGoogle Scholar
  2. Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. 2018. The best of both worlds: Combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849 (2018).Google ScholarGoogle Scholar
  3. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  4. Caiwen Ding, Siyu Liao, YanzhiWang, Zhe Li, Ning Liu, Youwei Zhuo, ChaoWang, Xuehai Qian, Yu Bai, Geng Yuan, et al. 2017. Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 395--408.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Caiwen Ding, Ao Ren, Geng Yuan, Xiaolong Ma, Jiayu Li, Ning Liu, Bo Yuan, and Yanzhi Wang. 2018. Structured weight matrices-based hardware accelerators in deep neural networks: Fpgas and asics. In Proceedings of the 2018 on Great Lakes Symposium on VLSI. 353--358.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mitchell A Gordon, Kevin Duh, and Nicholas Andrews. 2020. Compressing BERT: Studying the effects of weight pruning on transfer learning. arXiv preprint arXiv:2002.08307 (2020).Google ScholarGoogle Scholar
  7. Song Han, Jeff Pool, John Tran, and William J Dally. 2015. Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626 (2015).Google ScholarGoogle Scholar
  8. Weiwen Jiang, Edwin H-M Sha, Xinyi Zhang, Lei Yang, Qingfeng Zhuge, Yiyu Shi, and Jingtong Hu. 2019. Achieving super-linear speedup across multi-fpga for real-time dnn inference. ACM Transactions on Embedded Computing Systems (TECS) 18, 5s (2019), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Jiang, H. M. Sha, Q. Zhuge, Y. Lei, X. Chen, and J. Hu. 2018. Heterogeneous FPGA-Based Cost-Optimal Design for Timing-Constrained CNNs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems PP, 11 (2018), 1--1.Google ScholarGoogle Scholar
  10. Weiwen Jiang, Lei Yang, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Shouzhen Gu, Sakyasingha Dasgupta, Yiyu Shi, and Jingtong Hu. 2020. Hardware/software co-exploration of neural architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 12 (2020), 4805--4815.Google ScholarGoogle ScholarCross RefCross Ref
  11. Weiwen Jiang, Xinyi Zhang, Edwin H-M Sha, Qingfeng Zhuge, Lei Yang, Yiyu Shi, and Jingtong Hu. 2019. Xfer: A novel design to achieve super-linear performance on multiple fpgas for real-time ai. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 305--305.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Jiang, X. Zhang, H. M Sha, L. Yang, Q. Zhuge, Y. Shi, and J. Hu. 2019. Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search. In 2019 56th ACM/IEEE Design Automation Conference (DAC).Google ScholarGoogle Scholar
  13. Bingbing Li, Zhenglun Kong, Tianyun Zhang, Ji Li, Zhengang Li, Hang Liu, and Caiwen Ding. 2020. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning. arXiv preprint arXiv:2009.08065 (2020).Google ScholarGoogle Scholar
  14. Bingbing Li, Santosh Pandey, Haowen Fang, Yanjun Lyv, Ji Li, Jieyang Chen, Mimi Xie, Lipeng Wan, Hang Liu, and Caiwen Ding. 2020. FTRANS: energyefficient acceleration of transformers using FPGA. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 175--180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. Liu and B. Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In The 29th ACM International Conference on Supercomputing (ICS '15).Google ScholarGoogle Scholar
  16. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016).Google ScholarGoogle Scholar
  17. Sharan Narang, Eric Undersander, and Gregory Diamos. 2017. Block-sparse recurrent neural networks. arXiv preprint arXiv:1711.02782 (2017).Google ScholarGoogle Scholar
  18. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).Google ScholarGoogle Scholar
  19. Runbin Shi, Yuhao Ding, Xuechao Wei, He Li, Hang Liu, Hayden K-H So, and Caiwen Ding. 2020. FTDL: a tailored FPGA-overlay for deep learning with high scalability. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  20. R. Shi, P. Dong, T. Geng, Y. Ding, and Y. Wang. 2020. CSB-RNN: A Faster-than- Realtime RNN Acceleration Framework with Compressed Structured Blocks. (2020).Google ScholarGoogle Scholar
  21. David So, Quoc Le, and Chen Liang. 2019. The evolved transformer. In International Conference on Machine Learning. PMLR, 5877--5886.Google ScholarGoogle Scholar
  22. Yuhong Song, Weiwen Jiang, Bingbing Li, Panjie Qi, Qingfeng Zhuge, Edwin Hsing-Mean Sha, Sakyasingha Dasgupta, Yiyu Shi, and Caiwen Ding. 2021. Dancing along Battery: Enabling Transformer with Run-time Reconfigurability on Mobile Devices. arXiv preprint arXiv:2102.06336 (2021).Google ScholarGoogle Scholar
  23. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google ScholarGoogle Scholar
  24. Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. Hat: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187 (2020).Google ScholarGoogle Scholar
  25. Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accelerating neural transformer via an average attention network. arXiv preprint arXiv:1805.00631 (2018).Google ScholarGoogle Scholar

Index Terms

  1. Accommodating Transformer onto FPGA: Coupling the Balanced Model Compression and FPGA-Implementation Optimization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      GLSVLSI '21: Proceedings of the 2021 on Great Lakes Symposium on VLSI
      June 2021
      504 pages
      ISBN:9781450383936
      DOI:10.1145/3453688

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 June 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate312of1,156submissions,27%

      Upcoming Conference

      GLSVLSI '24
      Great Lakes Symposium on VLSI 2024
      June 12 - 14, 2024
      Clearwater , FL , USA

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader