LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Yao, Feixiang; Zhang, Zhonghao; Ji, Zeyu; Liu, Bin; Gao, Haoyuan

doi:10.1007/s11227-023-05886-w

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Published: 09 February 2024

(2024)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Feixiang Yao¹,
Zhonghao Zhang¹,
Zeyu Ji¹,
Bin Liu^1,2,3 &
…
Haoyuan Gao¹

Abstract

As the cost of deep learning training increases, using heterogeneous GPU clusters is a reasonable way to scale cluster resources to support distributed deep learning (DDL) tasks. However, the commonly used synchronous stochastic gradient descent (SSGD) algorithm based on the bulk synchronous parallel (BSP) model suffers from stragglers in heterogeneous clusters, resulting in a significant reduction in training efficiency. To overcome this challenge, we propose load-balanced batching (LBB) to eliminate stragglers in DDL workloads. LBB first formulates the load balancing problem and builds performance models for all workers in DDL workloads, which is achieved by analyzing the relationship between DDL iteration time and each worker’s local batch size. Then the LBB balances all workers’ workloads by coordinating local batch sizes. In particular, the LBB greatly mitigates static stragglers and severe dynamic stragglers by solving the load balancing problem and eliminates stragglers by batch size fine-tuning during training. LBB is implemented in PyTorch, and extensive experiments are performed on a heterogeneous server equipped with four GPUs with three different models. The experimental results verify the effectiveness of LBB on standard benchmarks, demonstrating that LBB can significantly reduce training time by 64.57%, 59%, and 5.4% compared to SSGD, local SGD, and FlexRR, respectively, without sacrificing accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

Article 11 July 2020

BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster

Article 16 April 2019

Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks

Article Open access 08 June 2023

Data availability

The public datasets of CIFAR10 and CIFAR100 [23] used in the research are available at https://www.cs.toronto.edu/kriz/cifar.html.

Code availability

The authors will release LBB implementation for reproducibility after it is organized. The code will be released on https://github.com/FLYING37520/LBB.

References

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
Jiang P, Ergu D, Liu F, Cai Y, Ma B (2022) A review of Yolo algorithm developments. Proc Comput Sci 199:1066–1073. https://doi.org/10.1016/j.procs.2022.01.135
Article Google Scholar
Saharia C, Chan W, Saxena S, Li L, Whang J, Denton E, Ghasemipour SKS, Ayan BK, Mahdavi SS, Lopes RG, Salimans T, Ho J, Fleet DJ, Norouzi M (2022) Photorealistic text-to-image diffusion models with deep language understanding. arXiv:2205.11487 [cs.CV]
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. In: Proceedings of the 38th International Conference on Machine Learning, vol 139, pp 8821–8831
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, vol 139, pp 8748–8763. https://proceedings.mlr.press/v139/radford21a.html
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386
Article Google Scholar
Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B (2020) Megatron-LM: training multi-billion parameter language models using model parallelism. arXiv:1909.08053 [cs.CV]
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Advances in neural information processing systems, vol 33, pp 1877–1901
Tang Z, Shi S, Chu X, Wang W, Li B (2020) Communication-efficient distributed deep learning: a comprehensive survey. arXiv:2003.06307 [cs.CV]
Gan S, Jiang J, Yuan B, Zhang C, Lian X, Wang R, Chang J, Liu C, Shi H, Zhang S, Li X, Sun T, Yang S, Liu J (2021) Bagua: scaling up distributed learning with system relaxations. Proc VLDB Endow 15(4):804–813. https://doi.org/10.14778/3503585.3503590
Article Google Scholar
Jiang J, Cui B, Zhang C, Yu L (2017) Heterogeneity-aware distributed parameter servers. Association for Computing Machinery, New York, pp 463–478. https://doi.org/10.1145/3035918.3035933
Book Google Scholar
Narayanan D, Santhanam K, Kazhamiaka F, Phanishayee A, Zaharia M (2020) Heterogeneity-aware cluster scheduling policies for deep learning workloads. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pp 481–498. https://www.usenix.org/conference/osdi20/presentation/narayanan-deepak
Kim H, Song C, Lee H, Yu H (2023) Addressing straggler problem through dynamic partial all-reduce for distributed deep learning in heterogeneous GPU clusters. In: IEEE International Conference on Consumer Electronics (ICCE), pp 1–6. https://doi.org/10.1109/ICCE56470.2023.10043527
Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ML via a stale synchronous parallel parameter server. In: Advances in neural information processing systems, vol 26
Kavarakuntla T, Han L, Lloyd H, Latham A, Akintoye SB (2021) Performance analysis of distributed deep learning frameworks in a multi-GPU environment. In: 20th International Conference on Ubiquitous Computing and Communications (IUCC/CIT/DSCI/SmartCNS), pp 406–413. https://doi.org/10.1109/IUCC-CIT-DSCI-SmartCNS55181.2021.00071
Keuper J, Pfreundt F-J (2015) Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms. In: Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments. MLHPC ’15. Association for Computing Machinery, New York. https://doi.org/10.1145/2834892.2834893
Harlap A, Cui H, Dai W, Wei J, Ganger GR, Gibbons PB, Gibson GA, Xing EP (2016) Addressing the straggler problem for iterative convergent parallel ML. In: Proceedings of the Seventh ACM Symposium on Cloud Computing. Association for Computing Machinery, New York, pp 98–111. https://doi.org/10.1145/2987550.2987554
Moreno-Alvarez S, Haut JM, Paoletti ME, Rico-Gallego JA, Diaz-Martin JC, Plaza J (2020) Training deep neural networks: a static load balancing approach. J Supercomput 76:9739–9754
Article Google Scholar
Yang E, Kang D-K, Youn C-H (2020) BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster. J Supercomput 76:47–67
Article Google Scholar
Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2018) Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv:1706.02677 [cs.CV]
Tao Z, Li Q (2018) eSGD: communication efficient distributed deep learning on the edge. In: USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18). USENIX Association, Boston
Ye Q, Zhou Y, Shi M, Sun Y, Lv J (2022) DLB: a dynamic load balance strategy for distributed training of deep neural networks. IEEE Trans Emerg Top Comput Intell. https://doi.org/10.1109/TETCI.2022.3220224
Article Google Scholar
Krizhevsky A, Hinton G et al (2009) Learning multiple layers of features from tiny images
Li S, Zhao Y, Varma R, Salpekar O, Noordhuis P, Li T, Paszke A, Smith J, Vaughan B, Damania P et al (2020) Pytorch distributed: experiences on accelerating data parallel training. arXiv:2006.15704 [cs.CV]
Gitman YYI, Ginsburg B (2017) Scaling SGD batch size to 32k for imagenet training. arXiv:1708.03888 [cs.CV]
Li S, Walls RJ, Xu L, Guo T (2019) Speeding up deep learning with transient servers. In: IEEE International Conference on Autonomic Computing (ICAC), pp 125–135. https://doi.org/10.1109/ICAC.2019.00024
Li S, Walls RJ, Guo T (2020) Characterizing and modeling distributed training with transient cloud GPU servers. In: IEEE 40th International Conference on Distributed Computing Systems (ICDCS), pp 943–953. https://doi.org/10.1109/ICDCS47774.2020.00097
Zheng S, Meng Q, Wang T, Chen W, Yu N, Ma Z-M, Liu T-Y (2017) Asynchronous stochastic gradient descent with delay compensation. In: International Conference on Machine Learning, vol 70, pp 4120–4129. PMLR. https://proceedings.mlr.press/v70/zheng17b.html
Ko Y, Kim S-W (2022) SHAT: a novel asynchronous training algorithm that provides fast model convergence in distributed deep learning. Appl Sci. https://doi.org/10.3390/app12010292
Article Google Scholar
Zhang W, Gupta S, Lian X, Liu J (2016) Staleness-aware async-SGD for distributed deep learning. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp 2350–2356
Li S, Mangoubi O, Xu L, Guo T (2021) Sync-switch: hybrid parameter synchronization for distributed deep learning. In: IEEE 41st International Conference on Distributed Computing Systems (ICDCS), pp 528–538. https://doi.org/10.1109/ICDCS51616.2021.00057
Zhao X, Papagelis M, An A, Chen BX, Liu J, Hu Y (2019) Elastic bulk synchronous parallel model for distributed deep learning. In: IEEE International Conference on Data Mining (ICDM), pp 1504–1509. https://doi.org/10.1109/ICDM.2019.00198
Li S, Ben-Nun T, Girolamo SD, Alistarh D, Hoefler T (2020) Taming unbalanced training workloads in deep learning with partial collective operations. Association for Computing Machinery, New York, pp 45–61. https://doi.org/10.1145/3332466.3374528
Chen C, Weng Q, Wang W, Li B, Li B (2020) Semi-dynamic load balancing: efficient distributed learning in non-dedicated environments. Association for Computing Machinery, New York, pp 431–446. https://doi.org/10.1145/3419111.3421299
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cuDNN: efficient primitives for deep learning. arXiv:1410.0759 [cs.CV]
Stich SU (2018) Local SGD converges fast and communicates little. arXiv:1805.09767 [cs.CV]
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929 [cs.CV]
Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: practical guidelines for efficient CNN architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV)
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp 6105–6114

Download references

Acknowledgments

The authors would like to acknowledge the support of National Natural Science Foundation of China under grant No. 62376226, the Shaanxi’s Key Research and Development Program under grant 2023-ZDLNY-63, the Xianyang’s Key Research and Development Program under grant No. L2022-ZDYF-NY-019, and the Key Research and Development Program of Shaanxi under grants No. 2019ZDLNY07-06-01 and No. 2020NY-098.

Funding

This work is supported by National Natural Science Foundation of China under grant No. 62376226, the Shaanxi’s Key Research and Development Program under grant 2023-ZDLNY-63, the Xianyang’s Key Research and Development Program under grant No. L2022-ZDYF-NY-019, and the Key Research and Development Program of Shaanxi under grants No. 2019ZDLNY07-06-01 and No. 2020NY-098.

Author information

Authors and Affiliations

College of Information Engineering, Northwest A&F University, Yangling, China
Feixiang Yao, Zhonghao Zhang, Zeyu Ji, Bin Liu & Haoyuan Gao
Key Laboratory of Agricultural Internet of Things, Ministry of Agriculture and Rural Affairs, Northwest A&F University, Yangling, China
Bin Liu
Shaanxi Key Laboratory of Agricultural Information Perception and Intelligent Service, Northwest A&F University, Yangling, China
Bin Liu

Authors

Feixiang Yao
View author publications
You can also search for this author in PubMed Google Scholar
Zhonghao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zeyu Ji
View author publications
You can also search for this author in PubMed Google Scholar
Bin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haoyuan Gao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

FY proposed the idea, participated in the protocol design, authored the main sections of the paper, and produced the figures and tables. BL supervised the research and proofread the manuscript as the supervisor. HG participated in constructing the experimental workflow.

Corresponding authors

Correspondence to Zeyu Ji or Bin Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests as defined by Springer or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yao, F., Zhang, Z., Ji, Z. et al. LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster. J Supercomput (2024). https://doi.org/10.1007/s11227-023-05886-w

Download citation

Accepted: 26 December 2023
Published: 09 February 2024
DOI: https://doi.org/10.1007/s11227-023-05886-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Abstract

Access this article

Similar content being viewed by others

Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster

Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks

Data availability

Code availability

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Abstract

Access this article

Similar content being viewed by others

Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster

Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks

Data availability

Code availability

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation