Skip to main content
Log in

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

As the cost of deep learning training increases, using heterogeneous GPU clusters is a reasonable way to scale cluster resources to support distributed deep learning (DDL) tasks. However, the commonly used synchronous stochastic gradient descent (SSGD) algorithm based on the bulk synchronous parallel (BSP) model suffers from stragglers in heterogeneous clusters, resulting in a significant reduction in training efficiency. To overcome this challenge, we propose load-balanced batching (LBB) to eliminate stragglers in DDL workloads. LBB first formulates the load balancing problem and builds performance models for all workers in DDL workloads, which is achieved by analyzing the relationship between DDL iteration time and each worker’s local batch size. Then the LBB balances all workers’ workloads by coordinating local batch sizes. In particular, the LBB greatly mitigates static stragglers and severe dynamic stragglers by solving the load balancing problem and eliminates stragglers by batch size fine-tuning during training. LBB is implemented in PyTorch, and extensive experiments are performed on a heterogeneous server equipped with four GPUs with three different models. The experimental results verify the effectiveness of LBB on standard benchmarks, demonstrating that LBB can significantly reduce training time by 64.57%, 59%, and 5.4% compared to SSGD, local SGD, and FlexRR, respectively, without sacrificing accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Algorithm 3
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The public datasets of CIFAR10 and CIFAR100 [23] used in the research are available at https://www.cs.toronto.edu/kriz/cifar.html.

Code availability

The authors will release LBB implementation for reproducibility after it is organized. The code will be released on https://github.com/FLYING37520/LBB.

References

  1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30

  2. Jiang P, Ergu D, Liu F, Cai Y, Ma B (2022) A review of Yolo algorithm developments. Proc Comput Sci 199:1066–1073. https://doi.org/10.1016/j.procs.2022.01.135

    Article  Google Scholar 

  3. Saharia C, Chan W, Saxena S, Li L, Whang J, Denton E, Ghasemipour SKS, Ayan BK, Mahdavi SS, Lopes RG, Salimans T, Ho J, Fleet DJ, Norouzi M (2022) Photorealistic text-to-image diffusion models with deep language understanding. arXiv:2205.11487 [cs.CV]

  4. Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. In: Proceedings of the 38th International Conference on Machine Learning, vol 139, pp 8821–8831

  5. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, vol 139, pp 8748–8763. https://proceedings.mlr.press/v139/radford21a.html

  6. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386

    Article  Google Scholar 

  7. Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B (2020) Megatron-LM: training multi-billion parameter language models using model parallelism. arXiv:1909.08053 [cs.CV]

  8. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Advances in neural information processing systems, vol 33, pp 1877–1901

  9. Tang Z, Shi S, Chu X, Wang W, Li B (2020) Communication-efficient distributed deep learning: a comprehensive survey. arXiv:2003.06307 [cs.CV]

  10. Gan S, Jiang J, Yuan B, Zhang C, Lian X, Wang R, Chang J, Liu C, Shi H, Zhang S, Li X, Sun T, Yang S, Liu J (2021) Bagua: scaling up distributed learning with system relaxations. Proc VLDB Endow 15(4):804–813. https://doi.org/10.14778/3503585.3503590

    Article  Google Scholar 

  11. Jiang J, Cui B, Zhang C, Yu L (2017) Heterogeneity-aware distributed parameter servers. Association for Computing Machinery, New York, pp 463–478. https://doi.org/10.1145/3035918.3035933

    Book  Google Scholar 

  12. Narayanan D, Santhanam K, Kazhamiaka F, Phanishayee A, Zaharia M (2020) Heterogeneity-aware cluster scheduling policies for deep learning workloads. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pp 481–498. https://www.usenix.org/conference/osdi20/presentation/narayanan-deepak

  13. Kim H, Song C, Lee H, Yu H (2023) Addressing straggler problem through dynamic partial all-reduce for distributed deep learning in heterogeneous GPU clusters. In: IEEE International Conference on Consumer Electronics (ICCE), pp 1–6. https://doi.org/10.1109/ICCE56470.2023.10043527

  14. Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ML via a stale synchronous parallel parameter server. In: Advances in neural information processing systems, vol 26

  15. Kavarakuntla T, Han L, Lloyd H, Latham A, Akintoye SB (2021) Performance analysis of distributed deep learning frameworks in a multi-GPU environment. In: 20th International Conference on Ubiquitous Computing and Communications (IUCC/CIT/DSCI/SmartCNS), pp 406–413. https://doi.org/10.1109/IUCC-CIT-DSCI-SmartCNS55181.2021.00071

  16. Keuper J, Pfreundt F-J (2015) Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms. In: Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments. MLHPC ’15. Association for Computing Machinery, New York. https://doi.org/10.1145/2834892.2834893

  17. Harlap A, Cui H, Dai W, Wei J, Ganger GR, Gibbons PB, Gibson GA, Xing EP (2016) Addressing the straggler problem for iterative convergent parallel ML. In: Proceedings of the Seventh ACM Symposium on Cloud Computing. Association for Computing Machinery, New York, pp 98–111. https://doi.org/10.1145/2987550.2987554

  18. Moreno-Alvarez S, Haut JM, Paoletti ME, Rico-Gallego JA, Diaz-Martin JC, Plaza J (2020) Training deep neural networks: a static load balancing approach. J Supercomput 76:9739–9754

    Article  Google Scholar 

  19. Yang E, Kang D-K, Youn C-H (2020) BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster. J Supercomput 76:47–67

    Article  Google Scholar 

  20. Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2018) Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv:1706.02677 [cs.CV]

  21. Tao Z, Li Q (2018) eSGD: communication efficient distributed deep learning on the edge. In: USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18). USENIX Association, Boston

  22. Ye Q, Zhou Y, Shi M, Sun Y, Lv J (2022) DLB: a dynamic load balance strategy for distributed training of deep neural networks. IEEE Trans Emerg Top Comput Intell. https://doi.org/10.1109/TETCI.2022.3220224

    Article  Google Scholar 

  23. Krizhevsky A, Hinton G et al (2009) Learning multiple layers of features from tiny images

  24. Li S, Zhao Y, Varma R, Salpekar O, Noordhuis P, Li T, Paszke A, Smith J, Vaughan B, Damania P et al (2020) Pytorch distributed: experiences on accelerating data parallel training. arXiv:2006.15704 [cs.CV]

  25. Gitman YYI, Ginsburg B (2017) Scaling SGD batch size to 32k for imagenet training. arXiv:1708.03888 [cs.CV]

  26. Li S, Walls RJ, Xu L, Guo T (2019) Speeding up deep learning with transient servers. In: IEEE International Conference on Autonomic Computing (ICAC), pp 125–135. https://doi.org/10.1109/ICAC.2019.00024

  27. Li S, Walls RJ, Guo T (2020) Characterizing and modeling distributed training with transient cloud GPU servers. In: IEEE 40th International Conference on Distributed Computing Systems (ICDCS), pp 943–953. https://doi.org/10.1109/ICDCS47774.2020.00097

  28. Zheng S, Meng Q, Wang T, Chen W, Yu N, Ma Z-M, Liu T-Y (2017) Asynchronous stochastic gradient descent with delay compensation. In: International Conference on Machine Learning, vol 70, pp 4120–4129. PMLR. https://proceedings.mlr.press/v70/zheng17b.html

  29. Ko Y, Kim S-W (2022) SHAT: a novel asynchronous training algorithm that provides fast model convergence in distributed deep learning. Appl Sci. https://doi.org/10.3390/app12010292

    Article  Google Scholar 

  30. Zhang W, Gupta S, Lian X, Liu J (2016) Staleness-aware async-SGD for distributed deep learning. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp 2350–2356

  31. Li S, Mangoubi O, Xu L, Guo T (2021) Sync-switch: hybrid parameter synchronization for distributed deep learning. In: IEEE 41st International Conference on Distributed Computing Systems (ICDCS), pp 528–538. https://doi.org/10.1109/ICDCS51616.2021.00057

  32. Zhao X, Papagelis M, An A, Chen BX, Liu J, Hu Y (2019) Elastic bulk synchronous parallel model for distributed deep learning. In: IEEE International Conference on Data Mining (ICDM), pp 1504–1509. https://doi.org/10.1109/ICDM.2019.00198

  33. Li S, Ben-Nun T, Girolamo SD, Alistarh D, Hoefler T (2020) Taming unbalanced training workloads in deep learning with partial collective operations. Association for Computing Machinery, New York, pp 45–61. https://doi.org/10.1145/3332466.3374528

  34. Chen C, Weng Q, Wang W, Li B, Li B (2020) Semi-dynamic load balancing: efficient distributed learning in non-dedicated environments. Association for Computing Machinery, New York, pp 431–446. https://doi.org/10.1145/3419111.3421299

  35. Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E (2014) cuDNN: efficient primitives for deep learning. arXiv:1410.0759 [cs.CV]

  36. Stich SU (2018) Local SGD converges fast and communicates little. arXiv:1805.09767 [cs.CV]

  37. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778

  38. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929 [cs.CV]

  39. Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: practical guidelines for efficient CNN architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV)

  40. Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp 6105–6114

Download references

Acknowledgments

The authors would like to acknowledge the support of National Natural Science Foundation of China under grant No. 62376226, the Shaanxi’s Key Research and Development Program under grant 2023-ZDLNY-63, the Xianyang’s Key Research and Development Program under grant No. L2022-ZDYF-NY-019, and the Key Research and Development Program of Shaanxi under grants No. 2019ZDLNY07-06-01 and No. 2020NY-098.

Funding

This work is supported by National Natural Science Foundation of China under grant No. 62376226, the Shaanxi’s Key Research and Development Program under grant 2023-ZDLNY-63, the Xianyang’s Key Research and Development Program under grant No. L2022-ZDYF-NY-019, and the Key Research and Development Program of Shaanxi under grants No. 2019ZDLNY07-06-01 and No. 2020NY-098.

Author information

Authors and Affiliations

Authors

Contributions

FY proposed the idea, participated in the protocol design, authored the main sections of the paper, and produced the figures and tables. BL supervised the research and proofread the manuscript as the supervisor. HG participated in constructing the experimental workflow.

Corresponding authors

Correspondence to Zeyu Ji or Bin Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests as defined by Springer or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, F., Zhang, Z., Ji, Z. et al. LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster. J Supercomput (2024). https://doi.org/10.1007/s11227-023-05886-w

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-023-05886-w

Keywords

Navigation