research-article

Accommodating Transformer onto FPGA: Coupling the Balanced Model Compression and FPGA-Implementation Optimization

Authors:
Panjie Qi

East China Normal University, Shanghai, China

East China Normal University, Shanghai, China
View Profile

,
Yuhong Song

East China Normal University, Shanghai, China

East China Normal University, Shanghai, China
View Profile

,
Hongwu Peng

University of Connecticut, Storrs, CT, USA

University of Connecticut, Storrs, CT, USA
View Profile

,
Shaoyi Huang

University of Connecticut, Storrs, CT, USA

University of Connecticut, Storrs, CT, USA
View Profile

,
Qingfeng Zhuge

East China Normal University, Shanghai, China

East China Normal University, Shanghai, China
View Profile

,
Edwin Hsing-Mean Sha

East China Normal University, Shanghai, China

East China Normal University, Shanghai, China
View Profile

GLSVLSI '21: Proceedings of the 2021 on Great Lakes Symposium on VLSIJune 2021Pages 163–168https://doi.org/10.1145/3453688.3461739

Published:22 June 2021Publication History

GLSVLSI '21: Proceedings of the 2021 on Great Lakes Symposium on VLSI

Pages 163–168

ABSTRACT

Recently, Transformers gradually gain popularity and perform outstanding for many Natural Language Processing (NLP) tasks. However, Transformers suffer from heavy computation and memory footprint, making it difficult to deploy on embedded devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its advantages. However, the trained Transformer models are too large to accommodate to an FPGA fabric. To accommodate Transformer onto FPGA and achieve efficient execution, we propose an acceleration framework coupling the balanced model compression at the algorithm level and FPGA-implementation optimization at the hardware level. At algorithm level, we adopt a block-balanced pruning and propose an efficient sparse matrix storage format for this pruning technique, named Compressed Block Row (CBR). At the hardware level, we design an accelerator for sparse model. And we also abstract a performance analytic model to evaluate the performance of accelerator. Experiments show that our CBR format perform better than general formats and can significantly save storage space. And our accelerator can achieve $38\times$ and $1.93\times$ speedup compared to other works on CPU and GPU respectively.

Supplemental Material

GLSVLSI21-vlsi12s.mp4

mp4

85.7 MB

Download

References

Rb Et Al. 1995. Templates for Solution of Linear Systems: Building Blocks for Iterative Methods. Siam (1995).Google Scholar
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. 2018. The best of both worlds: Combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849 (2018).Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
Caiwen Ding, Siyu Liao, YanzhiWang, Zhe Li, Ning Liu, Youwei Zhuo, ChaoWang, Xuehai Qian, Yu Bai, Geng Yuan, et al. 2017. Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 395--408.Google ScholarDigital Library
Caiwen Ding, Ao Ren, Geng Yuan, Xiaolong Ma, Jiayu Li, Ning Liu, Bo Yuan, and Yanzhi Wang. 2018. Structured weight matrices-based hardware accelerators in deep neural networks: Fpgas and asics. In Proceedings of the 2018 on Great Lakes Symposium on VLSI. 353--358.Google ScholarDigital Library
Mitchell A Gordon, Kevin Duh, and Nicholas Andrews. 2020. Compressing BERT: Studying the effects of weight pruning on transfer learning. arXiv preprint arXiv:2002.08307 (2020).Google Scholar
Song Han, Jeff Pool, John Tran, and William J Dally. 2015. Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626 (2015).Google Scholar
Weiwen Jiang, Edwin H-M Sha, Xinyi Zhang, Lei Yang, Qingfeng Zhuge, Yiyu Shi, and Jingtong Hu. 2019. Achieving super-linear speedup across multi-fpga for real-time dnn inference. ACM Transactions on Embedded Computing Systems (TECS) 18, 5s (2019), 1--23.Google ScholarDigital Library
W. Jiang, H. M. Sha, Q. Zhuge, Y. Lei, X. Chen, and J. Hu. 2018. Heterogeneous FPGA-Based Cost-Optimal Design for Timing-Constrained CNNs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems PP, 11 (2018), 1--1.Google Scholar
Weiwen Jiang, Lei Yang, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Shouzhen Gu, Sakyasingha Dasgupta, Yiyu Shi, and Jingtong Hu. 2020. Hardware/software co-exploration of neural architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 12 (2020), 4805--4815.Google ScholarCross Ref
Weiwen Jiang, Xinyi Zhang, Edwin H-M Sha, Qingfeng Zhuge, Lei Yang, Yiyu Shi, and Jingtong Hu. 2019. Xfer: A novel design to achieve super-linear performance on multiple fpgas for real-time ai. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 305--305.Google ScholarDigital Library
W. Jiang, X. Zhang, H. M Sha, L. Yang, Q. Zhuge, Y. Shi, and J. Hu. 2019. Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search. In 2019 56th ACM/IEEE Design Automation Conference (DAC).Google Scholar
Bingbing Li, Zhenglun Kong, Tianyun Zhang, Ji Li, Zhengang Li, Hang Liu, and Caiwen Ding. 2020. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning. arXiv preprint arXiv:2009.08065 (2020).Google Scholar
Bingbing Li, Santosh Pandey, Haowen Fang, Yanjun Lyv, Ji Li, Jieyang Chen, Mimi Xie, Lipeng Wan, Hang Liu, and Caiwen Ding. 2020. FTRANS: energyefficient acceleration of transformers using FPGA. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 175--180.Google ScholarDigital Library
W. Liu and B. Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In The 29th ACM International Conference on Supercomputing (ICS '15).Google Scholar
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016).Google Scholar
Sharan Narang, Eric Undersander, and Gregory Diamos. 2017. Block-sparse recurrent neural networks. arXiv preprint arXiv:1711.02782 (2017).Google Scholar
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).Google Scholar
Runbin Shi, Yuhao Ding, Xuechao Wei, He Li, Hang Liu, Hayden K-H So, and Caiwen Ding. 2020. FTDL: a tailored FPGA-overlay for deep learning with high scalability. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google ScholarCross Ref
R. Shi, P. Dong, T. Geng, Y. Ding, and Y. Wang. 2020. CSB-RNN: A Faster-than- Realtime RNN Acceleration Framework with Compressed Structured Blocks. (2020).Google Scholar
David So, Quoc Le, and Chen Liang. 2019. The evolved transformer. In International Conference on Machine Learning. PMLR, 5877--5886.Google Scholar
Yuhong Song, Weiwen Jiang, Bingbing Li, Panjie Qi, Qingfeng Zhuge, Edwin Hsing-Mean Sha, Sakyasingha Dasgupta, Yiyu Shi, and Caiwen Ding. 2021. Dancing along Battery: Enabling Transformer with Run-time Reconfigurability on Mobile Devices. arXiv preprint arXiv:2102.06336 (2021).Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google Scholar
Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. Hat: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187 (2020).Google Scholar
Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accelerating neural transformer via an average attention network. arXiv preprint arXiv:1805.00631 (2018).Google Scholar

Index Terms

Accommodating Transformer onto FPGA: Coupling the Balanced Model Compression and FPGA-Implementation Optimization
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems

Recommendations

The Detection of Malicious Modifications in the FPGA
Abstract
Field Programmable Gate Arrays (FPGAs) are being widely used in a variety of embedded applications. Due to their programmable feature, FPGAs are the perfect choice for various hardware-based systems. In many of the competing types of FPGAs, the ...
Read More
Intel nehalem processor core made FPGA synthesizable
FPGA '10: Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays

We present a FPGA-synthesizable version of the Intel Nehalem processor core, synthesized, partitioned and mapped to a multi-FPGA emulation system consisting of Xilinx Virtex-4 and Virtex-5 FPGAs. To our knowledge, this is the first time a modern state-...
Read More
Real-time embedded systems powered by FPGA dynamic partial self-reconfiguration: a case study oriented to biometric recognition applications

This work aims to pave the way for an efficient open system architecture applied to embedded electronic applications to manage the processing of computationally complex algorithms at real-time and low-cost. The target is to define a standard ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GLSVLSI '21: Proceedings of the 2021 on Great Lakes Symposium on VLSI
June 2021
504 pages
ISBN:9781450383936
DOI:10.1145/3453688
General Chairs:
Yiran Chen
Duke University, USA
,
Victor Zhirnov
Semiconductor Research Corporation, USA
,
Program Chairs:
Avesta Sasan
George Mason University, USA
,
Ioannis Savidis
Drexel University, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 June 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
fpga
model compression
nlp
transformer
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate312of1,156submissions,27%
Upcoming Conference
GLSVLSI '24

Sponsor:

sigda

Great Lakes Symposium on VLSI 2024

June 12 - 14, 2024

Clearwater , FL , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 1,157
  Total Downloads
- Downloads (Last 12 months)463
- Downloads (Last 6 weeks)58
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Accommodating Transformer onto FPGA: Coupling the Balanced Model Compression and FPGA-Implementation Optimization

GLSVLSI '21: Proceedings of the 2021 on Great Lakes Symposium on VLSI

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

The Detection of Malicious Modifications in the FPGA

Intel nehalem processor core made FPGA synthesizable

Real-time embedded systems powered by FPGA dynamic partial self-reconfiguration: a case study oriented to biometric recognition applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Accommodating Transformer onto FPGA: Coupling the Balanced Model Compression and FPGA-Implementation Optimization

GLSVLSI '21: Proceedings of the 2021 on Great Lakes Symposium on VLSI

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

The Detection of Malicious Modifications in the FPGA

Intel nehalem processor core made FPGA synthesizable

Real-time embedded systems powered by FPGA dynamic partial self-reconfiguration: a case study oriented to biometric recognition applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media