Skip to main content
Log in

gCFS: completely fair scheduling on multiple GPUs for improved multi-DNN execution in terms of performance isolation

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Priority-based performance isolation among DNN tasks in a GPU cluster should be achieved from the perspective of GPUs computing DNNs, rather than the tasks on CPUs that supervise the DNNs. In this paper, we propose gCFS allowing each DNN to have GPU occupancy in proportion to the priority. It inherits the CPU-side fair-share scheduling policy, achieving GPU perspective performance isolation in proportion to priorities. Smaller scheduling granularity enables more precise control over the time slice on GPUs and makes DNN workloads more densely queued, reducing the GPU idle time. Elastically, in scheduling, the length of DNN workload is adjusted to the given time slice, and dynamically the optimal GPU selection is performed. Through experiments with concurrently running multiple DNNs, priority-based performance isolation is significantly improved compared to the case without gCFS, and the makespan and the DNN completion time are reduced by up to 40.4% and 41.8%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

  1. Karol M, Hluchyj M, Morgan S (1987) Input versus output queueing on a space-division packet switch. IEEE Trans Commun 35(12):1347–1356. https://doi.org/10.1109/TCOM.1987.1096719

    Article  Google Scholar 

  2. Xiao W, Bhardwaj R, Ramjee R, Sivathanu M, Kwatra N, Han Z,Patel P, Peng X, Zhao H, Zhang Q, Yang F, Zhou L (2018) Gandiva: introspective cluster scheduling for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp 595–610

  3. Peng Y, Bao Y, Chen Y, Wu C, Guo C (2018) Optimus: an efficient dynamic resource scheduler for deep learning clusters. In: Proceedings of the 13th EuroSys Conference, pp 1–14. https://doi.org/10.1145/3190508.3190517

  4. Chen Q, Yang H, Mars J, Tang L (2016) Baymax: QoS awareness and increased utilization for non-preemptive accelerators in warehouse scale computers. In: Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems, pp 681–696. https://doi.org/10.1145/2872362.2872368

  5. Chen Q, Yang H, Guo M, Kannan RS, Mars J, Tang L (2017) Prophet: precise QoS prediction on non-preemptive accelerators to improve utilization in warehouse-scale computers. In: Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems, pp 17–32. https://doi.org/10.1145/3037697.3037700

  6. Chaudhary S, Ramjee R, Sivathanu M, Kwatra N, Viswanatha S (2020) Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning. In: Proceedings of the 15th European Conference on Computer Systems, pp 1–16. https://doi.org/10.1145/3342195.3387555

  7. Mahajan K, Balasubramanian A, Singhvi A, Venkataraman S, Akella A, Phanishayee A, Chawla S (2020) Themis: fair and efficient GPU cluster scheduling. In: 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pp 289–304

  8. Le TN, Sun X, Chowdhury M, Liu Z (2020) AlloX: compute allocation in hybrid clusters. In: Proceedings of the 15th European Conference on Computer Systems, pp 1–16. https://doi.org/10.1145/3342195.3387547

  9. Baruah SK, Cohen NK, Plaxton CG, Varvel DA (1996) Proportionate progress: a notion of fairness in resource allocation. Algorithmica 15(6):600–625. https://doi.org/10.1007/BF01940883

    Article  MathSciNet  MATH  Google Scholar 

  10. Jones MB, Roşu D, Roşu M (1997) CPU reservations and time constraints: efficient, predictable scheduling of independent activities. SIGOPS Oper Syst Rev 31(5):198–211. https://doi.org/10.1145/268998.266689

    Article  Google Scholar 

  11. Kim M, Noh S, Hyeon J, Hong S (2018) Fair-share scheduling in single-ISA asymmetric multicore architecture via scaled virtual runtime and load redistribution. J Parallel Distrib Comput 111:174–186. https://doi.org/10.1016/j.jpdc.2017.08.012

    Article  Google Scholar 

  12. Kim J, Shin P, Kim M, Hong S (2020) Memory-aware fair-share scheduling for improved performance isolation in the linux kernel. IEEE Access 8:98874–98886. https://doi.org/10.1109/ACCESS.2020.2996596

    Article  Google Scholar 

  13. Huh S, Yoo J, Hong S (2015) Cross-layer resource control and scheduling for improving interactivity in android. Softw Pract Exp 45(11):1549–1570. https://doi.org/10.1002/spe.2285

    Article  Google Scholar 

  14. Amert T, Otterness N, Yang M, Anderson JH, Smith FD (2017) GPU scheduling on the nvidia tx2: hidden details revealed. In: 2017 IEEE Real-Time Systems Symposium (RTSS), pp 104–115. https://doi.org/10.1109/RTSS.2017.00017

  15. Lim C, Kim M (2021) ODMDEF: on-device multi-DNN execution framework utilizing adaptive layer-allocation on general purpose cores and accelerators. IEEE Access 9:85403–85417. https://doi.org/10.1109/ACCESS.2021.3088861

    Article  Google Scholar 

  16. Rennich S (2012) Cuda c/c++ streams and concurrency. https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf. Accessed 11 April 2022

  17. Schroeder TC (2011) Peer-to-peer and unified virtual addressing. https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf. Accessed 11 Apr 2022

  18. NVIDIA (2012) Issue efficiency. https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/issueefficiency.htm. Accessed 11 Apr 2022

  19. PyTorch. https://pytorch.org/. Accessed 11 Apr 2022

  20. Johnson J (2022) Learning pytorch with examples. https://pytorch.org/tutorials/beginner/pytorch_with_examples.html. Accessed 11 Oct 2022

  21. Ajitsaria A (2020) What is the python global interpreter lock (GIL)? https://realpython.com/python-gil/. Accessed 11 Apr 2022

  22. TorchScript. https://pytorch.org/docs/master/jit.html. Accessed 11 Oct 2022

  23. Yu X, Zeng N, Liu S, Zhang Y (2019) Utilization of DenseNet201 for diagnosis of breast abnormality. Mach Vis Appl 30(7):1135–1144. https://doi.org/10.1007/s00138-019-01042-8

    Article  Google Scholar 

  24. Nguyen LD, Lin D, Lin Z, Cao J (2018) Deep CNNs for microscopic image classification by exploiting transfer learning and feature concatenation. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp 1–5. https://doi.org/10.1109/ISCAS.2018.8351550

  25. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2818–2826

  26. Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning, pp 6105–6114

  27. NVIDIA Nsight Systems. https://developer.nvidia.com/nsight-systems. Accessed 11 Apr 2022

  28. Narayanan D, Santhanam K, Kazhamiaka F, Phanishayee A, Zaharia M (2020) Heterogeneity-aware cluster scheduling policies for deep learning workloads. In: 14th USENIX Symposium on Operating Systems Design and implementation (OSDI 20), pp 481–498

  29. Jeon M, Venkataraman S, Phanishayee A, Qian J, Xiao W, Yang F (2019) Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In: 2019 USENIX Annual Technical Conference (USENIX ATC 19), pp 947–960

  30. Gu J, Chowdhury M, Shin KG, Zhu Y, Jeon M, Qian J, Liu H, Guo C (2019) Tiresias: a GPU cluster manager for distributed deep learning. In: 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pp 485–500

  31. Aalto S, Ayesta U, Righter R (2009) On the Gittins index in the M/G/1 queue. Queueing Syst 63(1):437–458. https://doi.org/10.1007/s11134-009-9141-x

    Article  MathSciNet  MATH  Google Scholar 

  32. Gittins J, Glazebrook K, Weber R (2011) Multi-armed bandit allocation indices. Wiley, Hoboken

    Book  MATH  Google Scholar 

  33. Nuyens M, Wierman A (2008) The foreground–background queue: a survey. Perform Eval 65(3):286–307. https://doi.org/10.1016/j.peva.2007.06.028

    Article  Google Scholar 

  34. Chowdhury M, Stoica I (2015) Efficient coflow scheduling without prior knowledge. SIGCOMM Comput Commun Rev 45(4):393–406. https://doi.org/10.1145/2785956.2787480

    Article  Google Scholar 

  35. Corbató FJ, Merwin-Daggett M, Daley RC (1962) An experimental time-sharing system. In: Spring Joint Computer Conference, pp 335–344. https://doi.org/10.1145/1460833.1460871

  36. Zhao H,Han Z, Yang Z, Zhang Q, Yang F,Zhou L, Yang M, Lau FCM, Wang Y, Xiong Y, Wang B (2020) HiveD: sharing a GPU cluster for deep learning with guarantees. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pp 515–532

  37. Jain P, Mo X, Jain A, Subbaraj H, Durrani RS, Tumanov A, Gonzalez J, Stoica I (2018) Dynamic space–time scheduling for GPU inference. arXiv preprint arXiv:http://arxiv.org/abs/1901.00041

  38. Xiang Y, Kim H (2019) Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference. In: 2019 IEEE Real-Time Systems Symposium (RTSS), pp 392–405. https://doi.org/10.1109/RTSS46320.2019.00042

  39. Goswami A, Young J, Schwan K, Farooqui N, Gavrilovska A, Wolf M, Eisenhauer G (2016) GPUShare: fair-sharing middleware for GPU clouds. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 1796–1776. https://doi.org/10.1109/IPDPSW.2016.94

Download references

Acknowledgements

This research was financially supported by Hansung University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Myungsun Kim.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cho, H., Kim, M. gCFS: completely fair scheduling on multiple GPUs for improved multi-DNN execution in terms of performance isolation. J Supercomput 79, 5851–5877 (2023). https://doi.org/10.1007/s11227-022-04901-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04901-w

Keywords

Navigation