DRL-based and Bsld-Aware Job Scheduling for Apache Spark Cluster in Hybrid Cloud Computing Environments

Shi, Wenhu; Li, Hongjian; Zeng, Hang

doi:10.1007/s10723-022-09630-1

DRL-based and Bsld-Aware Job Scheduling for Apache Spark Cluster in Hybrid Cloud Computing Environments

Published: 09 December 2022

Volume 20, article number 44, (2022)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Wenhu Shi¹,
Hongjian Li¹ &
Hang Zeng¹

165 Accesses
1 Citation
Explore all metrics

Abstract

Spark is one of the most important big data computing engines, favored by academia and industry for its low latency and ease of use. The explosive growth in data volumes is causing computing tasks that could otherwise run on local or on-premise resources to become infeasible. The emergence of public clouds has solved the problem of shortage of local or on-premise resources. However, deploying clusters only on public clouds can be costly on the one hand and wasteful of available local resources on the other. Therefore, deploying a Spark cluster on both local and public cloud resources becomes a good solution to save cost and not waste local resources. When Spark is deployed in hybrid cloud environments, its default scheduling policy ignores job and environment characteristics leading to performance degradation and increased cluster usage costs. In this paper, A deep reinforcement learning-based (DRL-based) Spark job scheduler is proposed to improve cluster performance and reduce the total cost of cluster usage in hybrid cloud environments. Specifically, the proposed DRL agent can adaptively learn the characteristics of different types of jobs and hybrid cloud environments to rationally schedule Spark jobs to reduce the total cluster usage cost and the average bounded slowdown of jobs. A simulation environment is built to train the proposed scheduling agent, and the Spark Core module is extended to verify the effectiveness of the proposed scheduling agent. Experimental results show that the DRL-based algorithm improves performance by 5.55% and reduces the total cluster usage cost by 13.9% on average compared to the baseline algorithm in burst arrival mode.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of Kubernetes scheduling algorithms

Article Open access 13 June 2023

Review of job shop scheduling research and its new perspectives under Industry 4.0

Article 21 August 2017

Task scheduling in edge-fog-cloud architecture: a multi-objective load balancing approach using reinforcement learning algorithm

Article 05 January 2023

Data Availability

The datasets generated during the current study are available from the corresponding author on reasonable request.

References

Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp 1–10. IEEE (2010)
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10) (2010)
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: Stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 36(4) (2015)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A {Fault-Tolerant} abstraction for {In-Memory} cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp 15–28 (2012)
Rasmussen, R.V., Trick, M. A.: Round Robin scheduling–a survey. Eur. J. Oper. Res. 188(3), 617–636 (2008)
Article MathSciNet MATH Google Scholar
Li, C., Cai, Q., Luo, Y.: Dynamic data replacement and adaptive scheduling policies in spark. Clust. Comput., 1–19 (2022)
Li, H., Wei, Y., Xiong, Y., Ma, E., Tian, W.: A frequency-aware and energy-saving strategy based on DVFS for spark. J. Supercomput. 77(10), 11575–11596 (2021)
Article Google Scholar
Wang, K., Khan, M.M.H., Nguyen, N., Gokhale, S.: Design and implementation of an analytical framework for interference aware job scheduling on apache spark platform. Clust. Comput. 22(1), 2223–2237 (2019)
Article Google Scholar
Tang, Z., Zeng, A., Zhang, X., Yang, L., Li, K.: Dynamic memory-aware scheduling in spark computing environment. J. Parall. Distrib. Comput. 141, 10–22 (2020)
Article Google Scholar
Li, H., Wang, H., Fang, S., Zou, Y., Tian, W.: An energy-aware scheduling algorithm for big data applications in spark. Clust. Comput. 23(2), 593–609 (2020)
Article Google Scholar
Fu, Z., Tang, Z., Yang, L., Liu, C.: An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications. IEEE Trans. Parall. Distrib. Syst. 31 (10), 2406–2420 (2020)
Article Google Scholar
Li, D., Hu, Z., Lai, Z., Zhang, Y., Lu, K.: Coordinative scheduling of computation and communication in data-parallel systems. IEEE Trans. Comput. 70(12), 2182–2197 (2020)
MATH Google Scholar
Islam, M.T., Srirama, S.N., Karunasekera, S., Buyya, R.: Cost-efficient dynamic scheduling of big data applications in apache spark on cloud. J. Syst. Softw. 162, 110515 (2020)
Article Google Scholar
Roveda, L., Maskani, J., Franceschi, P., Abdi, A., Braghin, F., Molinari Tosatti, L., Pedrocchi, N.: Model-based reinforcement learning variable impedance control for human-robot collaboration. J. Intell. Robot. Syst. 100(2), 417–433 (2020)
Article Google Scholar
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Article Google Scholar
Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., Jurafsky, D.: Deep reinforcement learning for dialogue generation. arXiv:1606.01541 (2016)
Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., Jurafsky, D.: Deep reinforcement learning for dialogue generation. arXiv:1606.01541 (2016)
Berner, C., Brockman, G., Chan, B., Cheung, V., Dȩbiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al.: Dota 2 with large scale deep reinforcement learning. arXiv:1912.06680 (2019)
Duan, J., Shi, D., Diao, R., Li, H., Wang, Z., Zhang, B., Bian, D., Yi, Z.: Deep-reinforcement-learning-based autonomous voltage control for power grid operations. IEEE Trans. Power Syst. 35(1), 814–817 (2019)
Article Google Scholar
Zrigui, S., de Camargo, R.Y., Legrand, A., Trystram, D.: Improving the performance of batch schedulers using online job runtime classification. J. Parall. Distrib. Comput. 164, 83–95 (2022)
Article Google Scholar
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
Shetti, M.M., Li, B., Du, D.H.: E-VM: An elastic virtual machine scheduling algorithm to minimize the total cost of ownership in a hybrid cloud. In: 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pp 202–211. IEEE (2021)
Qiu, Z., Chen, L., Li, X.: Hybrid cloud resource scheduling with multi-dimensional configuration requirements. In: 2021 IEEE World Congress on Services (SERVICES), pp 133–138. IEEE (2021)
Wang, B., Wang, C., Huang, W., Song, Y., Qin, X.: Security-aware task scheduling with deadline constraints on heterogeneous hybrid clouds. J. Parall. Distrib. Comput. 153, 15–28 (2021)
Article Google Scholar
Yeh, T., Chen, Y.: Improving the hybrid cloud performance through disk activity-aware data access. Simul. Model. Pract. Theory 109, 102296 (2021)
Article Google Scholar
Li, C., Cai, Q., Luo, Y.: Dynamic data replacement and adaptive scheduling policies in spark. Clust. Comput. 25(2), 1421–1439 (2022)
Article Google Scholar
Liu, L., Xu, H.: Elasecutor: Elastic executor scheduling in data analytics systems. IEEE/ACM Trans. Networking 29(2), 681–694 (2021)
Article MathSciNet Google Scholar
Islam, M.T., Wu, H., Karunasekera, S., Buyya, R.: Sla-based scheduling of spark jobs in hybrid cloud computing environments. IEEE Trans. Comput. 71(5), 1117–1132 (2021)
Article MATH Google Scholar
Zade, B.M.H., Mansouri, N.: Improved red fox optimizer with fuzzy theory and game theory for task scheduling in cloud environment. J. Comput. Sci 63, 101805 (2022)
Article Google Scholar
Zhang, Z., Zhao, M., Wang, H., Cui, Z., Zhang, W.: An efficient interval many-objective evolutionary algorithm for cloud task scheduling problem under uncertainty. Inform. Sci. 583, 56–72 (2022)
Article Google Scholar
Islam, M.T., Karunasekera, S., Buyya, R.: Performance and cost-efficient spark job scheduling based on deep reinforcement learning in cloud computing environments. IEEE Trans. Parall. Distrib. Syst. 33(7), 1695–1710 (2021)
Article Google Scholar
Guo, W., Tian, W., Ye, Y., Xu, L., Wu, K.: Cloud resource scheduling with deep reinforcement learning and imitation learning. IEEE Internet of Things J. 8(5), 3576–3586 (2020)
Article Google Scholar
Mao, H., Schwarzkopf, M., Venkatakrishnan, S.B., Meng, Z., Alizadeh, M.: Learning scheduling algorithms for data processing clusters. In: Proceedings of the ACM Special Interest Group on Data Communication, pp 270–288 (2019)
Ran, L., Shi, X., Shang, M.: Slas-aware online task scheduling based on deep reinforcement learning method in cloud environment. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 1518–1525. IEEE (2019)
Li, T., Xu, Z., Tang, J., Wang, Y.: Model-free control for distributed stream data processing using deep reinforcement learning. arXiv:1803.01016 (2018)
Zade, B.M.H., Mansouri, N., Javidi, M. M.: A two-stage scheduler based on new caledonian crow learning algorithm and reinforcement learning strategy for cloud environment. J. Netw. Comput. Appl. 202, 103385 (2022)
Article Google Scholar
Wang, X., Zhang, L., Liu, Y., Li, F., Chen, Z., Zhao, C., Bai, T.: Dynamic scheduling of tasks in cloud manufacturing with multi-agent reinforcement learning. J. Manuf. Syst. 65, 130–145 (2022)
Article Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017)

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Nanshan Street, Chongqing, 400000, China
Wenhu Shi, Hongjian Li & Hang Zeng

Authors

Wenhu Shi
View author publications
You can also search for this author in PubMed Google Scholar
Hongjian Li
View author publications
You can also search for this author in PubMed Google Scholar
Hang Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wenhu Shi: Proposed an idea, Experiment, Wrote the manuscript. Hongjian Li: Proposed an idea, Experiment, Wrote the manuscript. Hang Zeng: Experiment, Helped to wrote also several sections of the manuscript, Proofreading.

Corresponding author

Correspondence to Hongjian Li.

Ethics declarations

Competing interests

None. The authors declare that they have no known conflict financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shi, W., Li, H. & Zeng, H. DRL-based and Bsld-Aware Job Scheduling for Apache Spark Cluster in Hybrid Cloud Computing Environments. J Grid Computing 20, 44 (2022). https://doi.org/10.1007/s10723-022-09630-1

Download citation

Received: 11 August 2022
Accepted: 29 October 2022
Published: 09 December 2022
DOI: https://doi.org/10.1007/s10723-022-09630-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DRL-based and Bsld-Aware Job Scheduling for Apache Spark Cluster in Hybrid Cloud Computing Environments

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Review of job shop scheduling research and its new perspectives under Industry 4.0

Task scheduling in edge-fog-cloud architecture: a multi-objective load balancing approach using reinforcement learning algorithm

Data Availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DRL-based and Bsld-Aware Job Scheduling for Apache Spark Cluster in Hybrid Cloud Computing Environments

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Review of job shop scheduling research and its new perspectives under Industry 4.0

Task scheduling in edge-fog-cloud architecture: a multi-objective load balancing approach using reinforcement learning algorithm

Data Availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation