research-article

EIE: efficient inference engine on compressed deep neural network

Authors:
Song Han

Stanford University

Stanford University
View Profile

,
Xingyu Liu

Stanford University

Stanford University
View Profile

,
Huizi Mao

Stanford University

Stanford University
View Profile

,
Jing Pu

Stanford University

Stanford University
View Profile

,
Ardavan Pedram

Stanford University

Stanford University
View Profile

,
Mark A. Horowitz

Stanford University

Stanford University
View Profile

,
William J. Dally

Stanford University and NVIDIA

Stanford University and NVIDIA
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 44 Issue 3June 2016pp 243–254https://doi.org/10.1145/3007787.3001163

Published:18 June 2016Publication History

ACM SIGARCH Computer Architecture News

Abstract

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.

Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×10⁴ frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.

References

A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in NIPS, 2012. Google ScholarDigital Library
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," arXiv:1409.4842, 2014.Google Scholar
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv:1409.1556, 2014.Google Scholar
T. Mikolov, M. Karafiát, L. Burget, J. Cernocky', and S. Khudanpur, "Recurrent neural network based language model." in INTER-SPEECH, September 26-30, 2010, 2010, pp. 1045--1048.Google Scholar
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278--2324, 1998.Google ScholarCross Ref
Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, "Deepface: Closing the gap to human-level performance in face verification," in CVPR. IEEE, 2014, pp. 1701--1708. Google ScholarDigital Library
A. Karpathy and L. Fei-Fei, "Deep visual-semantic alignments for generating image descriptions," arXiv:1412.2306, 2014.Google Scholar
A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew, "Deep learning with cots hpc systems," in 30th ICML, 2013.Google Scholar
M. Horowitz. Energy table for 45nm process, Stanford VLSI wiki. {Online}. Available: https://sites.google.com/site/seecprojectGoogle Scholar
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning," in ASPLOS, 2014. Google ScholarDigital Library
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer," in MICRO, December 2014. Google ScholarDigital Library
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: shifting vision processing closer to the sensor," in ISCA. ACM, 2015, pp. 92--104. Google ScholarDigital Library
C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, "Cnp: An fpga-based processor for convolutional networks," in FPL, 2009.Google Scholar
J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang, "Going deeper with embedded fpga platform for convolutional neural network," in FPGA, 2016. Google ScholarDigital Library
A. Shafiee and et al., "ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars," ISCA, 2016. Google ScholarDigital Library
S. Han, J. Pool, J. Tran, and W. J. Dally, "Learning both weights and connections for efficient neural networks," in Proceedings of Advances in Neural Information Processing Systems, 2015. Google ScholarDigital Library
R. Girshick, "Fast R-CNN," arXiv:1504.08083, 2015. Google ScholarDigital Library
S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, 1997.Google Scholar
A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional lstm and other neural network architectures," Neural Networks, 2005. Google ScholarDigital Library
N. D. Lane and P. Georgiev, "Can deep learning revolutionize mobile sensing?" in International Workshop on Mobile Computing Systems and Applications. ACM, 2015, pp. 117--122. Google ScholarDigital Library
Richard Dorrance and Fengbo Ren and Dejan Marković, "A Scalable Sparse Matrix-vector Multiplication Kernel for Energy-efficient Sparse-blas on FPGAs," in FPGA, 2014. Google ScholarDigital Library
V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in ICML, 2010.Google Scholar
S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding," International Conference on Learning Representations 2016.Google Scholar
R. W. Vuduc, "Automatic performance tuning of sparse matrix kernels," Ph.D. dissertation, UC Berkeley, 2003. Google ScholarDigital Library
N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "Cacti 6.0: A tool to model large caches," HP Laboratories, pp. 22--31, 2009.Google Scholar
NVIDIA. Technical brief: NVIDIA jetson TK1 development kit bringing GPU-accelerated computing to embedded systems.Google Scholar
NVIDIA. Whitepaper: GPU-based deep learning inference: A performance and power analysis.Google Scholar
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," arXiv:1408.5093, 2014.Google Scholar
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in Computer Vision and Pattern Recognition. 2009.Google Scholar
F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, "Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5mb model size," arXiv:1602.07360, 2016.Google Scholar
Ling Zhuo and Viktor K. Prasanna, "Sparse Matrix-Vector Multiplication on FPGAs," in FPGA, 2005. Google ScholarDigital Library
V. Eijkhout, LAPACK working note 50: Distributed sparse data structures for linear algebra operations, 1992.Google Scholar
A. Lavin, "Fast algorithms for convolutional neural networks," arXiv:1509.09308, 2015.Google Scholar
S. J. Hanson and L. Y. Pratt, "Comparing biases for minimal network construction with back-propagation," in NIPS, 1989. Google ScholarDigital Library
Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel, "Optimal brain damage." in NIPs, vol. 89, 1989.Google Scholar
B. Hassibi, D. G. Stork et al., "Second order derivatives for network pruning: Optimal brain surgeon," Advances in neural information processing systems, pp. 164--164, 1993. Google ScholarDigital Library
E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, "Exploiting linear structure within convolutional networks for efficient evaluation," in NIPS 2014. Google ScholarDigital Library
X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, "Efficient and accurate approximations of nonlinear convolutional networks," arXiv:1411.4229, 2014.Google Scholar
B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernndez-Lobato, G.-Y. Wei, and D. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators," ISCA, 2016. Google ScholarDigital Library
S. K. Esser and et al., "Convolutional networks for fast, energy-efficient neuromorphic computing," arXiv:1603.08270, 2016.Google Scholar
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in FPGA, 2015. Google ScholarDigital Library
Alexander Monakov and Anton Lokhmotov and Arutyun Avetisyan, "Automatically tuning sparse matrix-vector multiplication for GPU architectures," in HiPEAC, 2010. Google ScholarDigital Library
N. Bell and M. Garland, "Efficient sparse matrix-vector multiplication on cuda," Nvidia Technical Report NVR-2008-004, Tech. Rep., 2008.Google Scholar
Bell, Nathan and Garland, Michael, "Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors," in High Performance Computing Networking, Storage and Analysis, 2009. Google ScholarDigital Library
J. Fowers and K. Ovtcharov and K. Strauss and E.S. Chung and G. Stitt, "A high memory bandwidth fpga accelerator for sparse matrix-vector multiplication," in FCCM, 2014. Google ScholarDigital Library

Recommendations

EIE: efficient inference engine on compressed deep neural network
ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom ...
Read More
FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization
FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

With the trend to deploy Deep Neural Network (DNN) inference models on edge devices with limited resources, quantization techniques have been widely used to reduce on-chip storage and improve computation throughput. However, existing DNN quantization ...
Read More
A Unified Programmable Edge Matrix Processor for Deep Neural Networks and Matrix Algebra
Matrix Algebra and Deep Neural Networks represent foundational classes of computational algorithms across multiple emerging applications like Augmented Reality or Virtual Reality, autonomous navigation (cars, drones, robots), data science, and various ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 44, Issue 3
ISCA'16
June 2016
730 pages
ISSN:0163-5964
DOI:10.1145/3007787
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture
June 2016
756 pages
ISBN:9781467389471
General Chairs:
Sang Lyul Min
Seoul National University
,
Gabriel Loh
AMD Research
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2016
Check for updates
Author Tags
ASIC
algorithm-hardware co-design
deep learning
hardware acceleration
model compression
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,521
  Total Citations
  View Citations
- 7,243
  Total Downloads
- Downloads (Last 12 months)1,021
- Downloads (Last 6 weeks)114
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

EIE: efficient inference engine on compressed deep neural network

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Recommendations

EIE: efficient inference engine on compressed deep neural network

FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization

A Unified Programmable Edge Matrix Processor for Deep Neural Networks and Matrix Algebra