Abstract
Spark is one of the most important big data computing engines, favored by academia and industry for its low latency and ease of use. The explosive growth in data volumes is causing computing tasks that could otherwise run on local or on-premise resources to become infeasible. The emergence of public clouds has solved the problem of shortage of local or on-premise resources. However, deploying clusters only on public clouds can be costly on the one hand and wasteful of available local resources on the other. Therefore, deploying a Spark cluster on both local and public cloud resources becomes a good solution to save cost and not waste local resources. When Spark is deployed in hybrid cloud environments, its default scheduling policy ignores job and environment characteristics leading to performance degradation and increased cluster usage costs. In this paper, A deep reinforcement learning-based (DRL-based) Spark job scheduler is proposed to improve cluster performance and reduce the total cost of cluster usage in hybrid cloud environments. Specifically, the proposed DRL agent can adaptively learn the characteristics of different types of jobs and hybrid cloud environments to rationally schedule Spark jobs to reduce the total cluster usage cost and the average bounded slowdown of jobs. A simulation environment is built to train the proposed scheduling agent, and the Spark Core module is extended to verify the effectiveness of the proposed scheduling agent. Experimental results show that the DRL-based algorithm improves performance by 5.55% and reduces the total cluster usage cost by 13.9% on average compared to the baseline algorithm in burst arrival mode.
Similar content being viewed by others
Data Availability
The datasets generated during the current study are available from the corresponding author on reasonable request.
References
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp 1–10. IEEE (2010)
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10) (2010)
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: Stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 36(4) (2015)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A {Fault-Tolerant} abstraction for {In-Memory} cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp 15–28 (2012)
Rasmussen, R.V., Trick, M. A.: Round Robin scheduling–a survey. Eur. J. Oper. Res. 188(3), 617–636 (2008)
Li, C., Cai, Q., Luo, Y.: Dynamic data replacement and adaptive scheduling policies in spark. Clust. Comput., 1–19 (2022)
Li, H., Wei, Y., Xiong, Y., Ma, E., Tian, W.: A frequency-aware and energy-saving strategy based on DVFS for spark. J. Supercomput. 77(10), 11575–11596 (2021)
Wang, K., Khan, M.M.H., Nguyen, N., Gokhale, S.: Design and implementation of an analytical framework for interference aware job scheduling on apache spark platform. Clust. Comput. 22(1), 2223–2237 (2019)
Tang, Z., Zeng, A., Zhang, X., Yang, L., Li, K.: Dynamic memory-aware scheduling in spark computing environment. J. Parall. Distrib. Comput. 141, 10–22 (2020)
Li, H., Wang, H., Fang, S., Zou, Y., Tian, W.: An energy-aware scheduling algorithm for big data applications in spark. Clust. Comput. 23(2), 593–609 (2020)
Fu, Z., Tang, Z., Yang, L., Liu, C.: An optimal locality-aware task scheduling algorithm based on bipartite graph modelling for spark applications. IEEE Trans. Parall. Distrib. Syst. 31 (10), 2406–2420 (2020)
Li, D., Hu, Z., Lai, Z., Zhang, Y., Lu, K.: Coordinative scheduling of computation and communication in data-parallel systems. IEEE Trans. Comput. 70(12), 2182–2197 (2020)
Islam, M.T., Srirama, S.N., Karunasekera, S., Buyya, R.: Cost-efficient dynamic scheduling of big data applications in apache spark on cloud. J. Syst. Softw. 162, 110515 (2020)
Roveda, L., Maskani, J., Franceschi, P., Abdi, A., Braghin, F., Molinari Tosatti, L., Pedrocchi, N.: Model-based reinforcement learning variable impedance control for human-robot collaboration. J. Intell. Robot. Syst. 100(2), 417–433 (2020)
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., Jurafsky, D.: Deep reinforcement learning for dialogue generation. arXiv:1606.01541 (2016)
Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., Jurafsky, D.: Deep reinforcement learning for dialogue generation. arXiv:1606.01541 (2016)
Berner, C., Brockman, G., Chan, B., Cheung, V., Dȩbiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al.: Dota 2 with large scale deep reinforcement learning. arXiv:1912.06680 (2019)
Duan, J., Shi, D., Diao, R., Li, H., Wang, Z., Zhang, B., Bian, D., Yi, Z.: Deep-reinforcement-learning-based autonomous voltage control for power grid operations. IEEE Trans. Power Syst. 35(1), 814–817 (2019)
Zrigui, S., de Camargo, R.Y., Legrand, A., Trystram, D.: Improving the performance of batch schedulers using online job runtime classification. J. Parall. Distrib. Comput. 164, 83–95 (2022)
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym. arXiv:1606.01540 (2016)
Shetti, M.M., Li, B., Du, D.H.: E-VM: An elastic virtual machine scheduling algorithm to minimize the total cost of ownership in a hybrid cloud. In: 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pp 202–211. IEEE (2021)
Qiu, Z., Chen, L., Li, X.: Hybrid cloud resource scheduling with multi-dimensional configuration requirements. In: 2021 IEEE World Congress on Services (SERVICES), pp 133–138. IEEE (2021)
Wang, B., Wang, C., Huang, W., Song, Y., Qin, X.: Security-aware task scheduling with deadline constraints on heterogeneous hybrid clouds. J. Parall. Distrib. Comput. 153, 15–28 (2021)
Yeh, T., Chen, Y.: Improving the hybrid cloud performance through disk activity-aware data access. Simul. Model. Pract. Theory 109, 102296 (2021)
Li, C., Cai, Q., Luo, Y.: Dynamic data replacement and adaptive scheduling policies in spark. Clust. Comput. 25(2), 1421–1439 (2022)
Liu, L., Xu, H.: Elasecutor: Elastic executor scheduling in data analytics systems. IEEE/ACM Trans. Networking 29(2), 681–694 (2021)
Islam, M.T., Wu, H., Karunasekera, S., Buyya, R.: Sla-based scheduling of spark jobs in hybrid cloud computing environments. IEEE Trans. Comput. 71(5), 1117–1132 (2021)
Zade, B.M.H., Mansouri, N.: Improved red fox optimizer with fuzzy theory and game theory for task scheduling in cloud environment. J. Comput. Sci 63, 101805 (2022)
Zhang, Z., Zhao, M., Wang, H., Cui, Z., Zhang, W.: An efficient interval many-objective evolutionary algorithm for cloud task scheduling problem under uncertainty. Inform. Sci. 583, 56–72 (2022)
Islam, M.T., Karunasekera, S., Buyya, R.: Performance and cost-efficient spark job scheduling based on deep reinforcement learning in cloud computing environments. IEEE Trans. Parall. Distrib. Syst. 33(7), 1695–1710 (2021)
Guo, W., Tian, W., Ye, Y., Xu, L., Wu, K.: Cloud resource scheduling with deep reinforcement learning and imitation learning. IEEE Internet of Things J. 8(5), 3576–3586 (2020)
Mao, H., Schwarzkopf, M., Venkatakrishnan, S.B., Meng, Z., Alizadeh, M.: Learning scheduling algorithms for data processing clusters. In: Proceedings of the ACM Special Interest Group on Data Communication, pp 270–288 (2019)
Ran, L., Shi, X., Shang, M.: Slas-aware online task scheduling based on deep reinforcement learning method in cloud environment. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 1518–1525. IEEE (2019)
Li, T., Xu, Z., Tang, J., Wang, Y.: Model-free control for distributed stream data processing using deep reinforcement learning. arXiv:1803.01016 (2018)
Zade, B.M.H., Mansouri, N., Javidi, M. M.: A two-stage scheduler based on new caledonian crow learning algorithm and reinforcement learning strategy for cloud environment. J. Netw. Comput. Appl. 202, 103385 (2022)
Wang, X., Zhang, L., Liu, Y., Li, F., Chen, Z., Zhao, C., Bai, T.: Dynamic scheduling of tasks in cloud manufacturing with multi-agent reinforcement learning. J. Manuf. Syst. 65, 130–145 (2022)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv:1707.06347 (2017)
Author information
Authors and Affiliations
Contributions
Wenhu Shi: Proposed an idea, Experiment, Wrote the manuscript. Hongjian Li: Proposed an idea, Experiment, Wrote the manuscript. Hang Zeng: Experiment, Helped to wrote also several sections of the manuscript, Proofreading.
Corresponding author
Ethics declarations
Competing interests
None. The authors declare that they have no known conflict financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shi, W., Li, H. & Zeng, H. DRL-based and Bsld-Aware Job Scheduling for Apache Spark Cluster in Hybrid Cloud Computing Environments. J Grid Computing 20, 44 (2022). https://doi.org/10.1007/s10723-022-09630-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10723-022-09630-1