Abstract
High-throughput computing (HTC) is a computing paradigm that aims to accomplish jobs by easily breaking them into smaller, independent components. However, it requires a large amount of computing power for a long time. Most existing HTC frameworks are job-oriented without support for coscheduling with hardware architecture and task-level execution. Also, most of the frameworks reach a limited scale, and their usability needs further improvement. Herein, we present HTDcr, a job execution framework for the HTC on supercomputers. This study aims to improve the throughput, task dispatching, and usability of the framework. In detail, the throughput optimizations include a sophisticated designed task management system, a hierarchical scheduler, and the co-optimization of the task-scheduling strategy with the application and hardware characteristics. The optimizations for usability include a programable execution workflow, mechanisms for more robust and reliable service qualities, and a fine-grained resource allocation system for the colocation of multiple jobs. According to our evaluations, HTDcr can achieve outstanding scalability and high throughput on large-scale clusters for the HTC workload. We evaluate HTDcr with several microbenchmarks and real-world applications on Tianhe-2 and Sunway TaihuLight to demonstrate its effects on existing design mechanisms. For instance, the task scheduling for two real-world applications integrated with the application and hardware characteristics achieves 1.7× and 1.9× speedups over the basic task-scheduling strategy.
Similar content being viewed by others
References
Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol, 2008, 26: 1135–1145
Houshmand S, Aggarwal S, Flood R. Next gen PCFG password cracking. IEEE Trans Inform Forensic Secur, 2015, 10: 1776–1791
Ward L, Sivaraman G, Pauloski J G, et al. Colmena: scalable machine-learning-based steering of ensemble simulations for high performance computing. In: Proceedings of the IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), 2021. 9–20
Casalino L, Dommer A C, Gaieb Z, et al. AI-driven multiscale simulations illuminate mechanisms of SARS-CoV-2 spike dynamics. Int J High Perform Comput Appl, 2021, 35: 432–451
Caporaso J G, Lauber C L, Walters W A, et al. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J, 2012, 6: 1621–1624
Fajardo E M, Dost J M, Holzman B, et al. How much higher can HTCondor fly? J Phys-Conf Ser, 2015, 664: 062014
Karo M, Lagerstrom R, Kohnke M, et al. The application level placement scheduler. Cray User Group, 2006, 1–7
Yu W, Shen Y X, Li L, et al. Teno: an efficient high-throughput computing job execution framework on Tianhe-2. In: Proceedings of the 20th International Conference on High Performance Computing and Communications, 2018. 408–415
Thain D, Tannenbaum T, Livny M. Distributed computing in practice: the Condor experience. Concurr Comput-Pract Exper, 2005, 17: 323–356
Yoo A B, Jette M A, Grondona M. SLURM: simple Linux utility for resource management. In: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, 2003. 44–60
Goiri l, Le K, Haque M E, et al. Greenslot: scheduling energy consumption in green datacenters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2011. 1–11
Raicu I, Zhao Y, Dumitrescu C, et al. Falkon: a fast and light-weight task execution framework. In: Proceedings of the ACM/IEEE Conference on Supercomputing, 2007. 1–12
Raicu I, Zhang Z, Wilde M, et al. Enabling loosely-coupled serial job execution on the IBM BlueGene/P supercomputer and the SiCortex SC5832. 2008. ArXiv:0808.3536
Merzky A, Turilli M, Titov M, et al. Design and performance characterization of RADICAL-Pilot on leadership-class platforms. IEEE Trans Parallel Distrib Syst, 2021, 33: 818–829
Hagras T, Janeček J. Static vs. Dynamic list-scheduling performance comparison. Acta Polytech, 2003, 43: 6
Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM, 2008, 51: 107–113
Shvachko K, Kuang H, Radia S, et al. The hadoop distributed file system. In: Proceedings of the 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010. 1–10
Hursey J, Squyres J M, Mattox T I, et al. The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2007. 1–8
Hammersley J. Monte Carlo Methods. Berlin: Springer, 2013
Liao X, Xiao L, Yang C, et al. MilkyWay-2 supercomputer: system and application. Front Comput Sci, 2014, 8: 345–356
Fu H, Liao J, Yang J, et al. The Sunway TaihuLight supercomputer: system and applications. Sci China Inf Sci, 2016, 59: 072001
Ye J, McGinnis S, Madden T L. BLAST: improvements for better sequence analysis. Nucleic Acids Res, 2006, 34: 6–9
Hellman M. A cryptanalytic time-memory trade-off. IEEE Trans Inform Theor, 1980, 26: 401–406
Rivest R. RFC1321: The MD5 Message-digest Algorithm. RFC Editor, 1992
Eastlake D, Jones P. RFC3174: US Secure Hash Algorithm 1 (SHA1). RFC Editor, 2001
Acknowledgements
This work was supported by National Key R&D Program of China (Grant No. 2021YFB0301300), National Natural Science Foundation of China (Grant No. U1811461), Zhejiang Lab (Grant No. 2021KC0AB04), Major Program of Guangdong Basic and Applied Research (Grant No. 2019B030302002), Program for Guangdong Introducing Innovative and Entrepreneurial Teams (Grant No. 2016ZT06D211).
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Jiang, J., Huang, D., Chen, H. et al. HTDcr: a job execution framework for high-throughput computing on supercomputers. Sci. China Inf. Sci. 67, 112104 (2024). https://doi.org/10.1007/s11432-022-3657-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-022-3657-3