Skip to main content
Log in

DRL-based and Bsld-Aware Job Scheduling for Apache Spark Cluster in Hybrid Cloud Computing Environments

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Spark is one of the most important big data computing engines, favored by academia and industry for its low latency and ease of use. The explosive growth in data volumes is causing computing tasks that could otherwise run on local or on-premise resources to become infeasible. The emergence of public clouds has solved the problem of shortage of local or on-premise resources. However, deploying clusters only on public clouds can be costly on the one hand and wasteful of available local resources on the other. Therefore, deploying a Spark cluster on both local and public cloud resources becomes a good solution to save cost and not waste local resources. When Spark is deployed in hybrid cloud environments, its default scheduling policy ignores job and environment characteristics leading to performance degradation and increased cluster usage costs. In this paper, A deep reinforcement learning-based (DRL-based) Spark job scheduler is proposed to improve cluster performance and reduce the total cost of cluster usage in hybrid cloud environments. Specifically, the proposed DRL agent can adaptively learn the characteristics of different types of jobs and hybrid cloud environments to rationally schedule Spark jobs to reduce the total cluster usage cost and the average bounded slowdown of jobs. A simulation environment is built to train the proposed scheduling agent, and the Spark Core module is extended to verify the effectiveness of the proposed scheduling agent. Experimental results show that the DRL-based algorithm improves performance by 5.55% and reduces the total cluster usage cost by 13.9% on average compared to the baseline algorithm in burst arrival mode.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data Availability

The datasets generated during the current study are available from the corresponding author on reasonable request.

References

  1. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp 1–10. IEEE (2010)

  2. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10) (2010)

  3. Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: Stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 36(4) (2015)

  4. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A {Fault-Tolerant} abstraction for {In-Memory} cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp 15–28 (2012)

  5. Rasmussen, R.V., Trick, M. A.: Round Robin scheduling–a survey. Eur. J. Oper. Res. 188(3), 617–636 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  6. Li, C., Cai, Q., Luo, Y.: Dynamic data replacement and adaptive scheduling policies in spark. Clust. Comput., 1–19 (2022)

  7. Li, H., Wei, Y., Xiong, Y., Ma, E., Tian, W.: A frequency-aware and energy-saving strategy based on DVFS for spark. J. Supercomput. 77(10), 11575–11596 (2021)

    Article  Google Scholar 

  8. Wang, K., Khan, M.M.H., Nguyen, N., Gokhale, S.: Design and implementation of an analytical framework for interference aware job scheduling on apache spark platform. Clust. Comput. 22(1), 2223–2237 (2019)

    Article  Google Scholar 

  9. Tang, Z., Zeng, A., Zhang, X., Yang, L., Li, K.: Dynamic memory-aware scheduling in spark computing environment. J. Parall. Distrib. Comput. 141, 10–22 (2020)

    Article  Google Scholar 

  10. Li, H., Wang, H., Fang, S., Zou, Y., Tian, W.: An energy-aware scheduling algorithm for big data applications in spark. Clust. Comput. 23(2), 593–609 (2020)

    Article  Google Scholar 

  11. Fu, Z., Tang, Z., Yang, L., Liu, C.: An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications. IEEE Trans. Parall. Distrib. Syst. 31 (10), 2406–2420 (2020)

    Article  Google Scholar 

  12. Li, D., Hu, Z., Lai, Z., Zhang, Y., Lu, K.: Coordinative scheduling of computation and communication in data-parallel systems. IEEE Trans. Comput. 70(12), 2182–2197 (2020)

    MATH  Google Scholar 

  13. Islam, M.T., Srirama, S.N., Karunasekera, S., Buyya, R.: Cost-efficient dynamic scheduling of big data applications in apache spark on cloud. J. Syst. Softw. 162, 110515 (2020)

    Article  Google Scholar 

  14. Roveda, L., Maskani, J., Franceschi, P., Abdi, A., Braghin, F., Molinari Tosatti, L., Pedrocchi, N.: Model-based reinforcement learning variable impedance control for human-robot collaboration. J. Intell. Robot. Syst. 100(2), 417–433 (2020)

    Article  Google Scholar 

  15. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)

    Article  Google Scholar 

  16. Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., Jurafsky, D.: Deep reinforcement learning for dialogue generation. arXiv:1606.01541 (2016)

  17. Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., Jurafsky, D.: Deep reinforcement learning for dialogue generation. arXiv:1606.01541 (2016)

  18. Berner, C., Brockman, G., Chan, B., Cheung, V., Dȩbiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al.: Dota 2 with large scale deep reinforcement learning. arXiv:1912.06680 (2019)

  19. Duan, J., Shi, D., Diao, R., Li, H., Wang, Z., Zhang, B., Bian, D., Yi, Z.: Deep-reinforcement-learning-based autonomous voltage control for power grid operations. IEEE Trans. Power Syst. 35(1), 814–817 (2019)

    Article  Google Scholar 

  20. Zrigui, S., de Camargo, R.Y., Legrand, A., Trystram, D.: Improving the performance of batch schedulers using online job runtime classification. J. Parall. Distrib. Comput. 164, 83–95 (2022)

    Article  Google Scholar 

  21. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)

  22. Shetti, M.M., Li, B., Du, D.H.: E-VM: An elastic virtual machine scheduling algorithm to minimize the total cost of ownership in a hybrid cloud. In: 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pp 202–211. IEEE (2021)

  23. Qiu, Z., Chen, L., Li, X.: Hybrid cloud resource scheduling with multi-dimensional configuration requirements. In: 2021 IEEE World Congress on Services (SERVICES), pp 133–138. IEEE (2021)

  24. Wang, B., Wang, C., Huang, W., Song, Y., Qin, X.: Security-aware task scheduling with deadline constraints on heterogeneous hybrid clouds. J. Parall. Distrib. Comput. 153, 15–28 (2021)

    Article  Google Scholar 

  25. Yeh, T., Chen, Y.: Improving the hybrid cloud performance through disk activity-aware data access. Simul. Model. Pract. Theory 109, 102296 (2021)

    Article  Google Scholar 

  26. Li, C., Cai, Q., Luo, Y.: Dynamic data replacement and adaptive scheduling policies in spark. Clust. Comput. 25(2), 1421–1439 (2022)

    Article  Google Scholar 

  27. Liu, L., Xu, H.: Elasecutor: Elastic executor scheduling in data analytics systems. IEEE/ACM Trans. Networking 29(2), 681–694 (2021)

    Article  MathSciNet  Google Scholar 

  28. Islam, M.T., Wu, H., Karunasekera, S., Buyya, R.: Sla-based scheduling of spark jobs in hybrid cloud computing environments. IEEE Trans. Comput. 71(5), 1117–1132 (2021)

    Article  MATH  Google Scholar 

  29. Zade, B.M.H., Mansouri, N.: Improved red fox optimizer with fuzzy theory and game theory for task scheduling in cloud environment. J. Comput. Sci 63, 101805 (2022)

    Article  Google Scholar 

  30. Zhang, Z., Zhao, M., Wang, H., Cui, Z., Zhang, W.: An efficient interval many-objective evolutionary algorithm for cloud task scheduling problem under uncertainty. Inform. Sci. 583, 56–72 (2022)

    Article  Google Scholar 

  31. Islam, M.T., Karunasekera, S., Buyya, R.: Performance and cost-efficient spark job scheduling based on deep reinforcement learning in cloud computing environments. IEEE Trans. Parall. Distrib. Syst. 33(7), 1695–1710 (2021)

    Article  Google Scholar 

  32. Guo, W., Tian, W., Ye, Y., Xu, L., Wu, K.: Cloud resource scheduling with deep reinforcement learning and imitation learning. IEEE Internet of Things J. 8(5), 3576–3586 (2020)

    Article  Google Scholar 

  33. Mao, H., Schwarzkopf, M., Venkatakrishnan, S.B., Meng, Z., Alizadeh, M.: Learning scheduling algorithms for data processing clusters. In: Proceedings of the ACM Special Interest Group on Data Communication, pp 270–288 (2019)

  34. Ran, L., Shi, X., Shang, M.: Slas-aware online task scheduling based on deep reinforcement learning method in cloud environment. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 1518–1525. IEEE (2019)

  35. Li, T., Xu, Z., Tang, J., Wang, Y.: Model-free control for distributed stream data processing using deep reinforcement learning. arXiv:1803.01016 (2018)

  36. Zade, B.M.H., Mansouri, N., Javidi, M. M.: A two-stage scheduler based on new caledonian crow learning algorithm and reinforcement learning strategy for cloud environment. J. Netw. Comput. Appl. 202, 103385 (2022)

    Article  Google Scholar 

  37. Wang, X., Zhang, L., Liu, Y., Li, F., Chen, Z., Zhao, C., Bai, T.: Dynamic scheduling of tasks in cloud manufacturing with multi-agent reinforcement learning. J. Manuf. Syst. 65, 130–145 (2022)

    Article  Google Scholar 

  38. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017)

Download references

Author information

Authors and Affiliations

Authors

Contributions

Wenhu Shi: Proposed an idea, Experiment, Wrote the manuscript. Hongjian Li: Proposed an idea, Experiment, Wrote the manuscript. Hang Zeng: Experiment, Helped to wrote also several sections of the manuscript, Proofreading.

Corresponding author

Correspondence to Hongjian Li.

Ethics declarations

Competing interests

None. The authors declare that they have no known conflict financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shi, W., Li, H. & Zeng, H. DRL-based and Bsld-Aware Job Scheduling for Apache Spark Cluster in Hybrid Cloud Computing Environments. J Grid Computing 20, 44 (2022). https://doi.org/10.1007/s10723-022-09630-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10723-022-09630-1

Keywords

Navigation