Memory- and Communication-Aware Model Compression for Distributed Deep Learning Inference on IoT

Authors:
Kartikeya Bhardwaj

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Ching-Yi Lin

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Anderson Sartor

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Radu Marculescu

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 18 Issue 5sArticle No.: 82pp 1–22https://doi.org/10.1145/3358205

Published:08 October 2019Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Model compression has emerged as an important area of research for deploying deep learning models on Internet-of-Things (IoT). However, for extremely memory-constrained scenarios, even the compressed models cannot fit within the memory of a single device and, as a result, must be distributed across multiple devices. This leads to a distributed inference paradigm in which memory and communication costs represent a major bottleneck. Yet, existing model compression techniques are not communication-aware. Therefore, we propose Network of Neural Networks (NoNN), a new distributed IoT learning paradigm that compresses a large pretrained ‘teacher’ deep network into several disjoint and highly-compressed ‘student’ modules, without loss of accuracy. Moreover, we propose a network science-based knowledge partitioning algorithm for the teacher model, and then train individual students on the resulting disjoint partitions. Extensive experimentation on five image classification datasets, for user-defined memory/performance budgets, show that NoNN achieves higher accuracy than several baselines and similar accuracy as the teacher model, while using minimal communication among students. Finally, as a case study, we deploy the proposed model for CIFAR-10 dataset on edge devices and demonstrate significant improvements in memory footprint (up to 24×), performance (up to 12×), and energy per node (up to 14×) compared to the large teacher model. We further show that for distributed inference on multiple edge devices, our proposed NoNN model results in up to 33× reduction in total latency w.r.t. a state-of-the-art model compression baseline.

References

Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems. 2654--266Google Scholar
Facebook. 2017. ONNX: Open Neural Network Exchange Format. https://onnx.ai/Google Scholar
Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. 2018. ChannelNets: Compact and efficient convolutional neural networks via channel-wise convolutions. In Advances in Neural Information Processing Systems. 5203--5211.Google Scholar
Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv:1510.00149 (2015).Google Scholar
Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In NIPS. 1135--1143.Google Scholar
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv:1503.02531 (2015).Google Scholar
Jeremy Howard. 2018. Imagenet in 18 minutes. https://www.fast.ai/2018/08/10/fastai-diu-imagenet/. (2018). Accessed: 2018-10-01.Google Scholar
Itay Hubara et al. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. JMLR 18, 1 (2017), 6869--6898.Google Scholar
Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv:1602.07360 (2016).Google Scholar
Juyong Kim, Yookoon Park, Gunhee Kim, and Sung Ju Hwang. 2017. SplitNet: Learning to semantically split deep networks for parameter reduction and model parallelization. In International Conference on Machine Learning. 1866--1874.Google Scholar
Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016).Google ScholarDigital Library
Liangzhen Lai, Naveen Suda, and Vikas Chandra. 2017. Deep convolutional neural network inference with floating-point weights and fixed-point activations. arXiv:1703.03073 (2017).Google Scholar
Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient convnets. arXiv:1608.08710 (2016).Google Scholar
Jiachen Mao et al. 2017. Modnn: Local distributed mobile computing system for deep neural network. In 2017 DATE Conference. IEEE, 1396--1401.Google Scholar
Jiachen Mao, Zhongda Yang, Wei Wen, Chunpeng Wu, Linghao Song, Kent W. Nixon, Xiang Chen, Hai Li, and Yiran Chen. 2017. Mednn: A distributed mobile system with enhanced partition and deployment for large-scale dnns. In Proceedings of the 36th International Conference on Computer-Aided Design. IEEE Press, 751--756.Google ScholarDigital Library
Mark Newman, Albert-Laszlo Barabasi, and Duncan J. Watts. 2011. The Structure and Dynamics of Networks. Vol. 19. Princeton University Press.Google ScholarDigital Library
Mark E. J. Newman. 2006. Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103, 23 (2006), 8577--8582.Google ScholarCross Ref
Ariadna Quattoni and Antonio Torralba. 2009. Recognizing indoor scenes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 413--420.Google ScholarCross Ref
Mark Sandler et al. 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. arXiv:1801.04381 (2018).Google Scholar
STMicro. 2018. Datasheet for Arm-Based Microcontroller with up to 512KB total storage (including FLASH memory). Product Page: https://bit.ly/2I5ZSMR. Datasheet. https://bit.ly/2Kz8ehDGoogle Scholar
Zhiyuan Tang, Dong Wang, and Zhiyong Zhang. 2016. Recurrent neural network training with dark knowledge transfer. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5900--5904.Google ScholarDigital Library
Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. 2010. Caltech-UCSD birds 200. (2010).Google Scholar
Tien-Ju Yang et al. 2016. Designing energy-efficient convolutional neural networks using energy-aware pruning. arXiv:1611.05128 (2016).Google Scholar
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. BMVC (2016).Google Scholar
Sergey Zagoruyko and Nikos Komodakis. 2017. Improving the performance of convolutional neural networks via attention transfer. ICLR (2017).Google Scholar
Xiangyu Zhang et al. 2017. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. CoRR abs/1707.01083 (2017).Google Scholar
Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. 2017. Hello edge: Keyword spotting on microcontrollers. arXiv:1711.07128 (2017).Google Scholar
Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018. DeepThings: Distributed adaptive deep learning inference on resource-constrained IoT edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348--2359.Google ScholarCross Ref

Index Terms

Memory- and Communication-Aware Model Compression for Distributed Deep Learning Inference on IoT

Recommendations

Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model Inference
Deep Neural Networks (DNNs) have achieved remarkable success in various real-world applications. However, running a Deep Neural Network (DNN) typically requires hundreds of megabytes of memory footprints, making it challenging to deploy on resource-...
Read More
Flexi-Compression: A Flexible Model Compression Method for Autonomous Driving
DIVANet '21: Proceedings of the 11th ACM Symposium on Design and Analysis of Intelligent Vehicular Networks and Applications

Benefiting from the rapid development of convolutional neural networks, computer vision-based autonomous driving technologies are gradually being deployed in vehicles. However, these neural networks typically have a large number of parameters and ...
Read More
Training Integer-Only Deep Recurrent Neural Networks
Abstract
Recurrent neural networks (RNN) are the backbone of many text and speech applications. These architectures are typically made up of several computationally complex components such as; non-linear activation functions, normalization, bi-directional ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 18, Issue 5s
Special Issue ESWEEK 2019, CASES 2019, CODES+ISSS 2019 and EMSOFT 2019
October 2019
1423 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3365919
Editor:
Sandeep K. Shukla
Indian Institute of Technology, India
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 8 October 2019
- Accepted: 1 July 2019
- Revised: 1 June 2019
- Received: 1 April 2019
Published in tecs Volume 18, Issue 5s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Network of neural networks
communities
model compression
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 1,580
  Total Downloads
- Downloads (Last 12 months)223
- Downloads (Last 6 weeks)40
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Memory- and Communication-Aware Model Compression for Distributed Deep Learning Inference on IoT

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model Inference

Flexi-Compression: A Flexible Model Compression Method for Autonomous Driving

Training Integer-Only Deep Recurrent Neural Networks