skip to main content
research-article

EIE: efficient inference engine on compressed deep neural network

Published:18 June 2016Publication History
Skip Abstract Section

Abstract

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.

Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×104 frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.

References

  1. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in NIPS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," arXiv:1409.4842, 2014.Google ScholarGoogle Scholar
  3. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  4. T. Mikolov, M. Karafiát, L. Burget, J. Cernocky', and S. Khudanpur, "Recurrent neural network based language model." in INTER-SPEECH, September 26-30, 2010, 2010, pp. 1045--1048.Google ScholarGoogle Scholar
  5. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278--2324, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  6. Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, "Deepface: Closing the gap to human-level performance in face verification," in CVPR. IEEE, 2014, pp. 1701--1708. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Karpathy and L. Fei-Fei, "Deep visual-semantic alignments for generating image descriptions," arXiv:1412.2306, 2014.Google ScholarGoogle Scholar
  8. A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew, "Deep learning with cots hpc systems," in 30th ICML, 2013.Google ScholarGoogle Scholar
  9. M. Horowitz. Energy table for 45nm process, Stanford VLSI wiki. {Online}. Available: https://sites.google.com/site/seecprojectGoogle ScholarGoogle Scholar
  10. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning," in ASPLOS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer," in MICRO, December 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: shifting vision processing closer to the sensor," in ISCA. ACM, 2015, pp. 92--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, "Cnp: An fpga-based processor for convolutional networks," in FPL, 2009.Google ScholarGoogle Scholar
  14. J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang, "Going deeper with embedded fpga platform for convolutional neural network," in FPGA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Shafiee and et al., "ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars," ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Han, J. Pool, J. Tran, and W. J. Dally, "Learning both weights and connections for efficient neural networks," in Proceedings of Advances in Neural Information Processing Systems, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Girshick, "Fast R-CNN," arXiv:1504.08083, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, 1997.Google ScholarGoogle Scholar
  19. A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional lstm and other neural network architectures," Neural Networks, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. D. Lane and P. Georgiev, "Can deep learning revolutionize mobile sensing?" in International Workshop on Mobile Computing Systems and Applications. ACM, 2015, pp. 117--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Richard Dorrance and Fengbo Ren and Dejan Marković, "A Scalable Sparse Matrix-vector Multiplication Kernel for Energy-efficient Sparse-blas on FPGAs," in FPGA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in ICML, 2010.Google ScholarGoogle Scholar
  23. S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding," International Conference on Learning Representations 2016.Google ScholarGoogle Scholar
  24. R. W. Vuduc, "Automatic performance tuning of sparse matrix kernels," Ph.D. dissertation, UC Berkeley, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "Cacti 6.0: A tool to model large caches," HP Laboratories, pp. 22--31, 2009.Google ScholarGoogle Scholar
  26. NVIDIA. Technical brief: NVIDIA jetson TK1 development kit bringing GPU-accelerated computing to embedded systems.Google ScholarGoogle Scholar
  27. NVIDIA. Whitepaper: GPU-based deep learning inference: A performance and power analysis.Google ScholarGoogle Scholar
  28. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," arXiv:1408.5093, 2014.Google ScholarGoogle Scholar
  29. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in Computer Vision and Pattern Recognition. 2009.Google ScholarGoogle Scholar
  30. F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, "Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5mb model size," arXiv:1602.07360, 2016.Google ScholarGoogle Scholar
  31. Ling Zhuo and Viktor K. Prasanna, "Sparse Matrix-Vector Multiplication on FPGAs," in FPGA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. V. Eijkhout, LAPACK working note 50: Distributed sparse data structures for linear algebra operations, 1992.Google ScholarGoogle Scholar
  33. A. Lavin, "Fast algorithms for convolutional neural networks," arXiv:1509.09308, 2015.Google ScholarGoogle Scholar
  34. S. J. Hanson and L. Y. Pratt, "Comparing biases for minimal network construction with back-propagation," in NIPS, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel, "Optimal brain damage." in NIPs, vol. 89, 1989.Google ScholarGoogle Scholar
  36. B. Hassibi, D. G. Stork et al., "Second order derivatives for network pruning: Optimal brain surgeon," Advances in neural information processing systems, pp. 164--164, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, "Exploiting linear structure within convolutional networks for efficient evaluation," in NIPS 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, "Efficient and accurate approximations of nonlinear convolutional networks," arXiv:1411.4229, 2014.Google ScholarGoogle Scholar
  39. B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernndez-Lobato, G.-Y. Wei, and D. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators," ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. K. Esser and et al., "Convolutional networks for fast, energy-efficient neuromorphic computing," arXiv:1603.08270, 2016.Google ScholarGoogle Scholar
  41. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in FPGA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Alexander Monakov and Anton Lokhmotov and Arutyun Avetisyan, "Automatically tuning sparse matrix-vector multiplication for GPU architectures," in HiPEAC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. N. Bell and M. Garland, "Efficient sparse matrix-vector multiplication on cuda," Nvidia Technical Report NVR-2008-004, Tech. Rep., 2008.Google ScholarGoogle Scholar
  44. Bell, Nathan and Garland, Michael, "Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors," in High Performance Computing Networking, Storage and Analysis, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Fowers and K. Ovtcharov and K. Strauss and E.S. Chung and G. Stitt, "A high memory bandwidth fpga accelerator for sparse matrix-vector multiplication," in FCCM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 44, Issue 3
    ISCA'16
    June 2016
    730 pages
    ISSN:0163-5964
    DOI:10.1145/3007787
    Issue’s Table of Contents
    • cover image ACM Conferences
      ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture
      June 2016
      756 pages
      ISBN:9781467389471

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 18 June 2016

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader