ABSTRACT
Recently, Transformers gradually gain popularity and perform outstanding for many Natural Language Processing (NLP) tasks. However, Transformers suffer from heavy computation and memory footprint, making it difficult to deploy on embedded devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its advantages. However, the trained Transformer models are too large to accommodate to an FPGA fabric. To accommodate Transformer onto FPGA and achieve efficient execution, we propose an acceleration framework coupling the balanced model compression at the algorithm level and FPGA-implementation optimization at the hardware level. At algorithm level, we adopt a block-balanced pruning and propose an efficient sparse matrix storage format for this pruning technique, named Compressed Block Row (CBR). At the hardware level, we design an accelerator for sparse model. And we also abstract a performance analytic model to evaluate the performance of accelerator. Experiments show that our CBR format perform better than general formats and can significantly save storage space. And our accelerator can achieve $38\times$ and $1.93\times$ speedup compared to other works on CPU and GPU respectively.
Supplemental Material
- Rb Et Al. 1995. Templates for Solution of Linear Systems: Building Blocks for Iterative Methods. Siam (1995).Google Scholar
- Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. 2018. The best of both worlds: Combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849 (2018).Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- Caiwen Ding, Siyu Liao, YanzhiWang, Zhe Li, Ning Liu, Youwei Zhuo, ChaoWang, Xuehai Qian, Yu Bai, Geng Yuan, et al. 2017. Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 395--408.Google ScholarDigital Library
- Caiwen Ding, Ao Ren, Geng Yuan, Xiaolong Ma, Jiayu Li, Ning Liu, Bo Yuan, and Yanzhi Wang. 2018. Structured weight matrices-based hardware accelerators in deep neural networks: Fpgas and asics. In Proceedings of the 2018 on Great Lakes Symposium on VLSI. 353--358.Google ScholarDigital Library
- Mitchell A Gordon, Kevin Duh, and Nicholas Andrews. 2020. Compressing BERT: Studying the effects of weight pruning on transfer learning. arXiv preprint arXiv:2002.08307 (2020).Google Scholar
- Song Han, Jeff Pool, John Tran, and William J Dally. 2015. Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626 (2015).Google Scholar
- Weiwen Jiang, Edwin H-M Sha, Xinyi Zhang, Lei Yang, Qingfeng Zhuge, Yiyu Shi, and Jingtong Hu. 2019. Achieving super-linear speedup across multi-fpga for real-time dnn inference. ACM Transactions on Embedded Computing Systems (TECS) 18, 5s (2019), 1--23.Google ScholarDigital Library
- W. Jiang, H. M. Sha, Q. Zhuge, Y. Lei, X. Chen, and J. Hu. 2018. Heterogeneous FPGA-Based Cost-Optimal Design for Timing-Constrained CNNs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems PP, 11 (2018), 1--1.Google Scholar
- Weiwen Jiang, Lei Yang, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Shouzhen Gu, Sakyasingha Dasgupta, Yiyu Shi, and Jingtong Hu. 2020. Hardware/software co-exploration of neural architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 12 (2020), 4805--4815.Google ScholarCross Ref
- Weiwen Jiang, Xinyi Zhang, Edwin H-M Sha, Qingfeng Zhuge, Lei Yang, Yiyu Shi, and Jingtong Hu. 2019. Xfer: A novel design to achieve super-linear performance on multiple fpgas for real-time ai. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 305--305.Google ScholarDigital Library
- W. Jiang, X. Zhang, H. M Sha, L. Yang, Q. Zhuge, Y. Shi, and J. Hu. 2019. Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search. In 2019 56th ACM/IEEE Design Automation Conference (DAC).Google Scholar
- Bingbing Li, Zhenglun Kong, Tianyun Zhang, Ji Li, Zhengang Li, Hang Liu, and Caiwen Ding. 2020. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning. arXiv preprint arXiv:2009.08065 (2020).Google Scholar
- Bingbing Li, Santosh Pandey, Haowen Fang, Yanjun Lyv, Ji Li, Jieyang Chen, Mimi Xie, Lipeng Wan, Hang Liu, and Caiwen Ding. 2020. FTRANS: energyefficient acceleration of transformers using FPGA. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 175--180.Google ScholarDigital Library
- W. Liu and B. Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In The 29th ACM International Conference on Supercomputing (ICS '15).Google Scholar
- Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016).Google Scholar
- Sharan Narang, Eric Undersander, and Gregory Diamos. 2017. Block-sparse recurrent neural networks. arXiv preprint arXiv:1711.02782 (2017).Google Scholar
- Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).Google Scholar
- Runbin Shi, Yuhao Ding, Xuechao Wei, He Li, Hang Liu, Hayden K-H So, and Caiwen Ding. 2020. FTDL: a tailored FPGA-overlay for deep learning with high scalability. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google ScholarCross Ref
- R. Shi, P. Dong, T. Geng, Y. Ding, and Y. Wang. 2020. CSB-RNN: A Faster-than- Realtime RNN Acceleration Framework with Compressed Structured Blocks. (2020).Google Scholar
- David So, Quoc Le, and Chen Liang. 2019. The evolved transformer. In International Conference on Machine Learning. PMLR, 5877--5886.Google Scholar
- Yuhong Song, Weiwen Jiang, Bingbing Li, Panjie Qi, Qingfeng Zhuge, Edwin Hsing-Mean Sha, Sakyasingha Dasgupta, Yiyu Shi, and Caiwen Ding. 2021. Dancing along Battery: Enabling Transformer with Run-time Reconfigurability on Mobile Devices. arXiv preprint arXiv:2102.06336 (2021).Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google Scholar
- Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. Hat: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187 (2020).Google Scholar
- Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accelerating neural transformer via an average attention network. arXiv preprint arXiv:1805.00631 (2018).Google Scholar
Index Terms
- Accommodating Transformer onto FPGA: Coupling the Balanced Model Compression and FPGA-Implementation Optimization
Recommendations
The Detection of Malicious Modifications in the FPGA
AbstractField Programmable Gate Arrays (FPGAs) are being widely used in a variety of embedded applications. Due to their programmable feature, FPGAs are the perfect choice for various hardware-based systems. In many of the competing types of FPGAs, the ...
Intel nehalem processor core made FPGA synthesizable
FPGA '10: Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arraysWe present a FPGA-synthesizable version of the Intel Nehalem processor core, synthesized, partitioned and mapped to a multi-FPGA emulation system consisting of Xilinx Virtex-4 and Virtex-5 FPGAs. To our knowledge, this is the first time a modern state-...
Real-time embedded systems powered by FPGA dynamic partial self-reconfiguration: a case study oriented to biometric recognition applications
This work aims to pave the way for an efficient open system architecture applied to embedded electronic applications to manage the processing of computationally complex algorithms at real-time and low-cost. The target is to define a standard ...
Comments