Abstract
Model compression has emerged as an important area of research for deploying deep learning models on Internet-of-Things (IoT). However, for extremely memory-constrained scenarios, even the compressed models cannot fit within the memory of a single device and, as a result, must be distributed across multiple devices. This leads to a distributed inference paradigm in which memory and communication costs represent a major bottleneck. Yet, existing model compression techniques are not communication-aware. Therefore, we propose Network of Neural Networks (NoNN), a new distributed IoT learning paradigm that compresses a large pretrained ‘teacher’ deep network into several disjoint and highly-compressed ‘student’ modules, without loss of accuracy. Moreover, we propose a network science-based knowledge partitioning algorithm for the teacher model, and then train individual students on the resulting disjoint partitions. Extensive experimentation on five image classification datasets, for user-defined memory/performance budgets, show that NoNN achieves higher accuracy than several baselines and similar accuracy as the teacher model, while using minimal communication among students. Finally, as a case study, we deploy the proposed model for CIFAR-10 dataset on edge devices and demonstrate significant improvements in memory footprint (up to 24×), performance (up to 12×), and energy per node (up to 14×) compared to the large teacher model. We further show that for distributed inference on multiple edge devices, our proposed NoNN model results in up to 33× reduction in total latency w.r.t. a state-of-the-art model compression baseline.
- Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems. 2654--266Google Scholar
- Facebook. 2017. ONNX: Open Neural Network Exchange Format. https://onnx.ai/Google Scholar
- Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. 2018. ChannelNets: Compact and efficient convolutional neural networks via channel-wise convolutions. In Advances in Neural Information Processing Systems. 5203--5211.Google Scholar
- Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv:1510.00149 (2015).Google Scholar
- Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In NIPS. 1135--1143.Google Scholar
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv:1503.02531 (2015).Google Scholar
- Jeremy Howard. 2018. Imagenet in 18 minutes. https://www.fast.ai/2018/08/10/fastai-diu-imagenet/. (2018). Accessed: 2018-10-01.Google Scholar
- Itay Hubara et al. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. JMLR 18, 1 (2017), 6869--6898.Google Scholar
- Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv:1602.07360 (2016).Google Scholar
- Juyong Kim, Yookoon Park, Gunhee Kim, and Sung Ju Hwang. 2017. SplitNet: Learning to semantically split deep networks for parameter reduction and model parallelization. In International Conference on Machine Learning. 1866--1874.Google Scholar
- Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016).Google ScholarDigital Library
- Liangzhen Lai, Naveen Suda, and Vikas Chandra. 2017. Deep convolutional neural network inference with floating-point weights and fixed-point activations. arXiv:1703.03073 (2017).Google Scholar
- Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient convnets. arXiv:1608.08710 (2016).Google Scholar
- Jiachen Mao et al. 2017. Modnn: Local distributed mobile computing system for deep neural network. In 2017 DATE Conference. IEEE, 1396--1401.Google Scholar
- Jiachen Mao, Zhongda Yang, Wei Wen, Chunpeng Wu, Linghao Song, Kent W. Nixon, Xiang Chen, Hai Li, and Yiran Chen. 2017. Mednn: A distributed mobile system with enhanced partition and deployment for large-scale dnns. In Proceedings of the 36th International Conference on Computer-Aided Design. IEEE Press, 751--756.Google ScholarDigital Library
- Mark Newman, Albert-Laszlo Barabasi, and Duncan J. Watts. 2011. The Structure and Dynamics of Networks. Vol. 19. Princeton University Press.Google ScholarDigital Library
- Mark E. J. Newman. 2006. Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103, 23 (2006), 8577--8582.Google ScholarCross Ref
- Ariadna Quattoni and Antonio Torralba. 2009. Recognizing indoor scenes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 413--420.Google ScholarCross Ref
- Mark Sandler et al. 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. arXiv:1801.04381 (2018).Google Scholar
- STMicro. 2018. Datasheet for Arm-Based Microcontroller with up to 512KB total storage (including FLASH memory). Product Page: https://bit.ly/2I5ZSMR. Datasheet. https://bit.ly/2Kz8ehDGoogle Scholar
- Zhiyuan Tang, Dong Wang, and Zhiyong Zhang. 2016. Recurrent neural network training with dark knowledge transfer. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5900--5904.Google ScholarDigital Library
- Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. 2010. Caltech-UCSD birds 200. (2010).Google Scholar
- Tien-Ju Yang et al. 2016. Designing energy-efficient convolutional neural networks using energy-aware pruning. arXiv:1611.05128 (2016).Google Scholar
- Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. BMVC (2016).Google Scholar
- Sergey Zagoruyko and Nikos Komodakis. 2017. Improving the performance of convolutional neural networks via attention transfer. ICLR (2017).Google Scholar
- Xiangyu Zhang et al. 2017. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. CoRR abs/1707.01083 (2017).Google Scholar
- Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. 2017. Hello edge: Keyword spotting on microcontrollers. arXiv:1711.07128 (2017).Google Scholar
- Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018. DeepThings: Distributed adaptive deep learning inference on resource-constrained IoT edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348--2359.Google ScholarCross Ref
Index Terms
- Memory- and Communication-Aware Model Compression for Distributed Deep Learning Inference on IoT
Recommendations
Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model Inference
Deep Neural Networks (DNNs) have achieved remarkable success in various real-world applications. However, running a Deep Neural Network (DNN) typically requires hundreds of megabytes of memory footprints, making it challenging to deploy on resource-...
Flexi-Compression: A Flexible Model Compression Method for Autonomous Driving
DIVANet '21: Proceedings of the 11th ACM Symposium on Design and Analysis of Intelligent Vehicular Networks and ApplicationsBenefiting from the rapid development of convolutional neural networks, computer vision-based autonomous driving technologies are gradually being deployed in vehicles. However, these neural networks typically have a large number of parameters and ...
Training Integer-Only Deep Recurrent Neural Networks
AbstractRecurrent neural networks (RNN) are the backbone of many text and speech applications. These architectures are typically made up of several computationally complex components such as; non-linear activation functions, normalization, bi-directional ...
Comments