research-article

Public Access

Mercury: Efficient On-Device Distributed DNN Training via Stochastic Importance Sampling

Authors:
Xiao Zeng

Michigan State University

Michigan State University
View Profile

,
Ming Yan

Michigan State University

Michigan State University
View Profile

,
Mi Zhang

Michigan State University

Michigan State University
View Profile

SenSys '21: Proceedings of the 19th ACM Conference on Embedded Networked Sensor SystemsNovember 2021Pages 29–41https://doi.org/10.1145/3485730.3485930

Published:15 November 2021Publication History

SenSys '21: Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems

Pages 29–41

ABSTRACT

As intelligence is moving from data centers to the edges, intelligent edge devices such as smartphones, drones, robots, and smart IoT devices are equipped with the capability to altogether train a deep learning model on the devices from the data collected by themselves. Despite its considerable value, the key bottleneck of making on-device distributed training practically useful in real-world deployments is that they consume a significant amount of training time under wireless networks with constrained bandwidth. To tackle this critical bottleneck, we present Mercury, an importance sampling-based framework that enhances the training efficiency of on-device distributed training without compromising the accuracies of the trained models. The key idea behind the design of Mercury is to focus on samples that provide more important information in each training iteration. In doing this, the training efficiency of each iteration is improved. As such, the total number of iterations can be considerably reduced so as to speed up the overall training process. We implemented Mercury and deployed it on a self-developed testbed. We demonstrate its effectiveness and show that Mercury consistently outperforms two status quo frameworks on six commonly used datasets across tasks in image classification, speech recognition, and natural language processing.

References

Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio. 2015. Variance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481 (2015).Google Scholar
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Proceedings of the Advances in Neural Information Processing Systems. 1709--1720.Google Scholar
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning. 41--48.Google ScholarDigital Library
Henggang Cui, Hao Zhang, Gregory R Ganger, Phillip B Gibbons, and Eric P Xing. 2016. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems. ACM, 4.Google ScholarDigital Library
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.Google ScholarCross Ref
Sanghamitra Dutta, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, and Priya Nagpurkar. 2018. Slow and stale gradients can win the race: Error-runtime trade-offs in distributed SGD. arXiv preprint arXiv:1803.01113 (2018).Google Scholar
Biyi Fang, Xiao Zeng, Faen Zhang, Hui Xu, and Mi Zhang. 2020. FlexDNN: Input-Adaptive On-Device Deep Learning for Efficient Mobile Vision. In ACM/IEEE Symposium on Edge Computing (SEC).Google Scholar
Biyi Fang, Xiao Zeng, and Mi Zhang. 2018. NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking (MobiCom). New Delhi, India, 115--127.Google ScholarDigital Library
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).Google Scholar
Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo Parrilo. 2015. Why random reshuffling beats stochastic gradient descent. arXiv preprint arXiv:1510.08560 (2015).Google Scholar
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and Roy H Campbell. 2019. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling. In Proceedings of the 2nd SysML Conference.Google Scholar
Chaoyang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Abhishek Singh, Hang Qiu, et al. 2020. FedML: A research library and benchmark for federated machine learning. arXiv preprint arXiv:2007.13518 (2020).Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29 (2012).Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.Google ScholarDigital Library
Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R Ganger, Phillip B Gibbons, and Onur Mutlu. 2017. Gaia: Geo-distributed machine learning approaching LAN speeds. In NSDI. 629--647.Google Scholar
Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. 2019. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335 (2019).Google Scholar
Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. arXiv preprint arXiv:1905.03960 (2019).Google Scholar
Shuang Jiang, Zhiyao Ma, Xiao Zeng, Chenren Xu, Mi Zhang, Chen Zhang, and Yunxin Liu. 2020. SCYLLA: QoE-aware Continuous Mobile Vision with FPGA-based Dynamic Deep Neural Network Reconfiguration. In IEEE International Conference on Computer Communications (INFOCOM).Google Scholar
Angelos Katharopoulos and François Fleuret. 2017. Biased importance sampling for deep neural network training. arXiv preprint arXiv:1706.00043 (2017).Google Scholar
Angelos Katharopoulos and François Fleuret. 2018. Not all samples are created equal: Deep learning with importance sampling. arXiv preprint arXiv:1803.00942 (2018).Google Scholar
Jakub Konečny, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).Google Scholar
Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical Report. Citeseer.Google Scholar
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation (OSDI). 583--598.Google ScholarDigital Library
Youjie Li, Mingchao Yu, Songze Li, Salman Avestimehr, Nam Sung Kim, and Alexander Schwing. 2018. Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training. In Proceedings of the Advances in Neural Information Processing Systems. 8045--8056.Google Scholar
Hyeontaek Lim, David G Andersen, and Michael Kaminsky. 2018. 3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning. arXiv preprint arXiv:1802.07389 (2018).Google Scholar
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887 (2017).Google Scholar
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics. PMLR, 1273--1282.Google Scholar
Qi Meng, Wei Chen, Yue Wang, Zhi-Ming Ma, and Tie-Yan Liu. 2017. Convergence analysis of distributed stochastic gradient descent with shuffling. arXiv preprint arXiv:1709.10432 (2017).Google Scholar
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510--4520.Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
tensorflow. 2018. Tensorflow Speech Command Dataset. https://www.tensorflow.org/tutorials/sequences/audio_recognitionGoogle Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in neural information processing systems. 5998--6008.Google Scholar
Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H Brendan McMahan, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, et al. 2021. A Field Guide to Federated Optimization. arXiv preprint arXiv:2107.06917 (2021).Google Scholar
Jianyu Wang and Gauri Joshi. 2018. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD. arXiv preprint arXiv:1810.08313 (2018).Google Scholar
Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. 2018. Gradient sparsification for communication-efficient distributed optimization. In Proceedings of the Advances in Neural Information Processing Systems. 1299--1309.Google Scholar
Pijika Watcharapichat, Victoria Lopez Morales, Raul Castro Fernandez, and Peter Pietzuch. 2016. Ako: Decentralised deep learning with partial gradient exchange. In Proceedings of the Seventh ACM Symposium on Cloud Computing. ACM, 84--97.Google ScholarDigital Library
Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems. 1509--1519.Google Scholar
Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.Google ScholarDigital Library
Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. 2017. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing 55, 7 (2017), 3965--3981.Google ScholarCross Ref
Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data 1, 2 (2015), 49--67.Google ScholarCross Ref
Xiao Zeng, Biyi Fang, Haichen Shen, and Mi Zhang. 2020. Distream: Scaling Live Video Analytics with Workload-Adaptive Distributed Edge Intelligence. In ACM Conference on Embedded Networked Sensor Systems (SenSys).Google ScholarDigital Library
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on {GPU} clusters. In 2017 {USENIX} Annual Technical Conference ({USENIX} {ATC} 17). 181--193.Google Scholar
Mi Zhang, Faen Zhang, Nicholas D Lane, Yuanchao Shu, Xiao Zeng, Biyi Fang, Shen Yan, and Hui Xu. 2020. Deep Learning in the Era of Edge Computing: Challenges and Opportunities. Fog Computing: Theory and Practice (2020), 67--78.Google Scholar
Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep learning with elastic averaging SGD. Advances in neural information processing systems 28 (2015), 685--693.Google Scholar
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proceedings of the Advances in neural information processing systems. 649--657.Google Scholar
Peilin Zhao and Tong Zhang. 2015. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the International Conference on Machine Learning. 1--9.Google Scholar

Index Terms

Mercury: Efficient On-Device Distributed DNN Training via Stochastic Importance Sampling
1. Computing methodologies
  1. Artificial intelligence
    1. Distributed artificial intelligence

Recommendations

Importance sampling based discriminative learning for large scale offline handwritten Chinese character recognition

The development of a discriminative learning framework based on importance sampling for large-scale classification tasks is reported in this paper. The framework involves the assignment of samples with different weights according to the sample ...
Read More
Importance sampling for minibatches

Minibatching is a very well studied and highly popular technique in supervised learning, used by practitioners due to its ability to accelerate training through better utilization of parallel processing power and reduction of stochastic variance. ...
Read More
Online Stream Sampling for Low-Memory On-Device Edge Training for WiFi Sensing
WiseML '22: Proceedings of the 2022 ACM Workshop on Wireless Security and Machine Learning

Deploying machine learning models on-board edge devices allows for low latency model inference and data privacy by keeping sensor data local to the computation rather than at a central server. However, typical TinyML systems train a single global model ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SenSys '21: Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems
November 2021
686 pages
ISBN:9781450390972
DOI:10.1145/3485730

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 November 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
AIoT
Edge Computing
Importance Sampling
Machine Learning Systems
On-Device Distributed Training
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
SenSys '21 Paper Acceptance Rate25of139submissions,18%Overall Acceptance Rate174of867submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 758
  Total Downloads
- Downloads (Last 12 months)277
- Downloads (Last 6 weeks)47
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mercury: Efficient On-Device Distributed DNN Training via Stochastic Importance Sampling

SenSys '21: Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Importance sampling based discriminative learning for large scale offline handwritten Chinese character recognition

Importance sampling for minibatches

Online Stream Sampling for Low-Memory On-Device Edge Training for WiFi Sensing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Mercury: Efficient On-Device Distributed DNN Training via Stochastic Importance Sampling

SenSys '21: Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Importance sampling based discriminative learning for large scale offline handwritten Chinese character recognition

Importance sampling for minibatches

Online Stream Sampling for Low-Memory On-Device Edge Training for WiFi Sensing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media