Abstract
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.
Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×104 frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in NIPS, 2012. Google ScholarDigital Library
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," arXiv:1409.4842, 2014.Google Scholar
- K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv:1409.1556, 2014.Google Scholar
- T. Mikolov, M. Karafiát, L. Burget, J. Cernocky', and S. Khudanpur, "Recurrent neural network based language model." in INTER-SPEECH, September 26-30, 2010, 2010, pp. 1045--1048.Google Scholar
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278--2324, 1998.Google ScholarCross Ref
- Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, "Deepface: Closing the gap to human-level performance in face verification," in CVPR. IEEE, 2014, pp. 1701--1708. Google ScholarDigital Library
- A. Karpathy and L. Fei-Fei, "Deep visual-semantic alignments for generating image descriptions," arXiv:1412.2306, 2014.Google Scholar
- A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew, "Deep learning with cots hpc systems," in 30th ICML, 2013.Google Scholar
- M. Horowitz. Energy table for 45nm process, Stanford VLSI wiki. {Online}. Available: https://sites.google.com/site/seecprojectGoogle Scholar
- T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning," in ASPLOS, 2014. Google ScholarDigital Library
- Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer," in MICRO, December 2014. Google ScholarDigital Library
- Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: shifting vision processing closer to the sensor," in ISCA. ACM, 2015, pp. 92--104. Google ScholarDigital Library
- C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, "Cnp: An fpga-based processor for convolutional networks," in FPL, 2009.Google Scholar
- J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang, "Going deeper with embedded fpga platform for convolutional neural network," in FPGA, 2016. Google ScholarDigital Library
- A. Shafiee and et al., "ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars," ISCA, 2016. Google ScholarDigital Library
- S. Han, J. Pool, J. Tran, and W. J. Dally, "Learning both weights and connections for efficient neural networks," in Proceedings of Advances in Neural Information Processing Systems, 2015. Google ScholarDigital Library
- R. Girshick, "Fast R-CNN," arXiv:1504.08083, 2015. Google ScholarDigital Library
- S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, 1997.Google Scholar
- A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional lstm and other neural network architectures," Neural Networks, 2005. Google ScholarDigital Library
- N. D. Lane and P. Georgiev, "Can deep learning revolutionize mobile sensing?" in International Workshop on Mobile Computing Systems and Applications. ACM, 2015, pp. 117--122. Google ScholarDigital Library
- Richard Dorrance and Fengbo Ren and Dejan Marković, "A Scalable Sparse Matrix-vector Multiplication Kernel for Energy-efficient Sparse-blas on FPGAs," in FPGA, 2014. Google ScholarDigital Library
- V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in ICML, 2010.Google Scholar
- S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding," International Conference on Learning Representations 2016.Google Scholar
- R. W. Vuduc, "Automatic performance tuning of sparse matrix kernels," Ph.D. dissertation, UC Berkeley, 2003. Google ScholarDigital Library
- N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "Cacti 6.0: A tool to model large caches," HP Laboratories, pp. 22--31, 2009.Google Scholar
- NVIDIA. Technical brief: NVIDIA jetson TK1 development kit bringing GPU-accelerated computing to embedded systems.Google Scholar
- NVIDIA. Whitepaper: GPU-based deep learning inference: A performance and power analysis.Google Scholar
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," arXiv:1408.5093, 2014.Google Scholar
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in Computer Vision and Pattern Recognition. 2009.Google Scholar
- F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, "Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5mb model size," arXiv:1602.07360, 2016.Google Scholar
- Ling Zhuo and Viktor K. Prasanna, "Sparse Matrix-Vector Multiplication on FPGAs," in FPGA, 2005. Google ScholarDigital Library
- V. Eijkhout, LAPACK working note 50: Distributed sparse data structures for linear algebra operations, 1992.Google Scholar
- A. Lavin, "Fast algorithms for convolutional neural networks," arXiv:1509.09308, 2015.Google Scholar
- S. J. Hanson and L. Y. Pratt, "Comparing biases for minimal network construction with back-propagation," in NIPS, 1989. Google ScholarDigital Library
- Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel, "Optimal brain damage." in NIPs, vol. 89, 1989.Google Scholar
- B. Hassibi, D. G. Stork et al., "Second order derivatives for network pruning: Optimal brain surgeon," Advances in neural information processing systems, pp. 164--164, 1993. Google ScholarDigital Library
- E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, "Exploiting linear structure within convolutional networks for efficient evaluation," in NIPS 2014. Google ScholarDigital Library
- X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, "Efficient and accurate approximations of nonlinear convolutional networks," arXiv:1411.4229, 2014.Google Scholar
- B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernndez-Lobato, G.-Y. Wei, and D. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators," ISCA, 2016. Google ScholarDigital Library
- S. K. Esser and et al., "Convolutional networks for fast, energy-efficient neuromorphic computing," arXiv:1603.08270, 2016.Google Scholar
- C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in FPGA, 2015. Google ScholarDigital Library
- Alexander Monakov and Anton Lokhmotov and Arutyun Avetisyan, "Automatically tuning sparse matrix-vector multiplication for GPU architectures," in HiPEAC, 2010. Google ScholarDigital Library
- N. Bell and M. Garland, "Efficient sparse matrix-vector multiplication on cuda," Nvidia Technical Report NVR-2008-004, Tech. Rep., 2008.Google Scholar
- Bell, Nathan and Garland, Michael, "Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors," in High Performance Computing Networking, Storage and Analysis, 2009. Google ScholarDigital Library
- J. Fowers and K. Ovtcharov and K. Strauss and E.S. Chung and G. Stitt, "A high memory bandwidth fpga accelerator for sparse matrix-vector multiplication," in FCCM, 2014. Google ScholarDigital Library
Recommendations
EIE: efficient inference engine on compressed deep neural network
ISCA '16: Proceedings of the 43rd International Symposium on Computer ArchitectureState-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom ...
FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization
FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysWith the trend to deploy Deep Neural Network (DNN) inference models on edge devices with limited resources, quantization techniques have been widely used to reduce on-chip storage and improve computation throughput. However, existing DNN quantization ...
A Unified Programmable Edge Matrix Processor for Deep Neural Networks and Matrix Algebra
Matrix Algebra and Deep Neural Networks represent foundational classes of computational algorithms across multiple emerging applications like Augmented Reality or Virtual Reality, autonomous navigation (cars, drones, robots), data science, and various ...
Comments