HTDcr: a job execution framework for high-throughput computing on supercomputers

Jiang, Jiazhi; Huang, Dan; Chen, Hu; Lu, Yutong; Liao, Xiangke

doi:10.1007/s11432-022-3657-3

HTDcr: a job execution framework for high-throughput computing on supercomputers

Research Paper
Published: 22 December 2023

Volume 67, article number 112104, (2024)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Jiazhi Jiang¹,
Dan Huang¹,
Hu Chen²,
Yutong Lu¹ &
…
Xiangke Liao¹

110 Accesses
Explore all metrics

Abstract

High-throughput computing (HTC) is a computing paradigm that aims to accomplish jobs by easily breaking them into smaller, independent components. However, it requires a large amount of computing power for a long time. Most existing HTC frameworks are job-oriented without support for coscheduling with hardware architecture and task-level execution. Also, most of the frameworks reach a limited scale, and their usability needs further improvement. Herein, we present HTDcr, a job execution framework for the HTC on supercomputers. This study aims to improve the throughput, task dispatching, and usability of the framework. In detail, the throughput optimizations include a sophisticated designed task management system, a hierarchical scheduler, and the co-optimization of the task-scheduling strategy with the application and hardware characteristics. The optimizations for usability include a programable execution workflow, mechanisms for more robust and reliable service qualities, and a fine-grained resource allocation system for the colocation of multiple jobs. According to our evaluations, HTDcr can achieve outstanding scalability and high throughput on large-scale clusters for the HTC workload. We evaluate HTDcr with several microbenchmarks and real-world applications on Tianhe-2 and Sunway TaihuLight to demonstrate its effects on existing design mechanisms. For instance, the task scheduling for two real-world applications integrated with the application and hardware characteristics achieves 1.7× and 1.9× speedups over the basic task-scheduling strategy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Containerization technologies: taxonomies, applications and challenges

Article 08 June 2021

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

WebAssembly as an Enabler for Next Generation Serverless Computing

Article 26 June 2023

References

Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol, 2008, 26: 1135–1145
Article Google Scholar
Houshmand S, Aggarwal S, Flood R. Next gen PCFG password cracking. IEEE Trans Inform Forensic Secur, 2015, 10: 1776–1791
Article Google Scholar
Ward L, Sivaraman G, Pauloski J G, et al. Colmena: scalable machine-learning-based steering of ensemble simulations for high performance computing. In: Proceedings of the IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), 2021. 9–20
Casalino L, Dommer A C, Gaieb Z, et al. AI-driven multiscale simulations illuminate mechanisms of SARS-CoV-2 spike dynamics. Int J High Perform Comput Appl, 2021, 35: 432–451
Article Google Scholar
Caporaso J G, Lauber C L, Walters W A, et al. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J, 2012, 6: 1621–1624
Article Google Scholar
Fajardo E M, Dost J M, Holzman B, et al. How much higher can HTCondor fly? J Phys-Conf Ser, 2015, 664: 062014
Article Google Scholar
Karo M, Lagerstrom R, Kohnke M, et al. The application level placement scheduler. Cray User Group, 2006, 1–7
Yu W, Shen Y X, Li L, et al. Teno: an efficient high-throughput computing job execution framework on Tianhe-2. In: Proceedings of the 20th International Conference on High Performance Computing and Communications, 2018. 408–415
Thain D, Tannenbaum T, Livny M. Distributed computing in practice: the Condor experience. Concurr Comput-Pract Exper, 2005, 17: 323–356
Article Google Scholar
Yoo A B, Jette M A, Grondona M. SLURM: simple Linux utility for resource management. In: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, 2003. 44–60
Goiri l, Le K, Haque M E, et al. Greenslot: scheduling energy consumption in green datacenters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2011. 1–11
Raicu I, Zhao Y, Dumitrescu C, et al. Falkon: a fast and light-weight task execution framework. In: Proceedings of the ACM/IEEE Conference on Supercomputing, 2007. 1–12
Raicu I, Zhang Z, Wilde M, et al. Enabling loosely-coupled serial job execution on the IBM BlueGene/P supercomputer and the SiCortex SC5832. 2008. ArXiv:0808.3536
Merzky A, Turilli M, Titov M, et al. Design and performance characterization of RADICAL-Pilot on leadership-class platforms. IEEE Trans Parallel Distrib Syst, 2021, 33: 818–829
Article Google Scholar
Hagras T, Janeček J. Static vs. Dynamic list-scheduling performance comparison. Acta Polytech, 2003, 43: 6
Article Google Scholar
Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM, 2008, 51: 107–113
Article Google Scholar
Shvachko K, Kuang H, Radia S, et al. The hadoop distributed file system. In: Proceedings of the 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010. 1–10
Hursey J, Squyres J M, Mattox T I, et al. The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2007. 1–8
Hammersley J. Monte Carlo Methods. Berlin: Springer, 2013
Google Scholar
Liao X, Xiao L, Yang C, et al. MilkyWay-2 supercomputer: system and application. Front Comput Sci, 2014, 8: 345–356
Article MathSciNet Google Scholar
Fu H, Liao J, Yang J, et al. The Sunway TaihuLight supercomputer: system and applications. Sci China Inf Sci, 2016, 59: 072001
Article Google Scholar
Ye J, McGinnis S, Madden T L. BLAST: improvements for better sequence analysis. Nucleic Acids Res, 2006, 34: 6–9
Article Google Scholar
Hellman M. A cryptanalytic time-memory trade-off. IEEE Trans Inform Theor, 1980, 26: 401–406
Article MathSciNet Google Scholar
Rivest R. RFC1321: The MD5 Message-digest Algorithm. RFC Editor, 1992
Eastlake D, Jones P. RFC3174: US Secure Hash Algorithm 1 (SHA1). RFC Editor, 2001

Download references

Acknowledgements

This work was supported by National Key R&D Program of China (Grant No. 2021YFB0301300), National Natural Science Foundation of China (Grant No. U1811461), Zhejiang Lab (Grant No. 2021KC0AB04), Major Program of Guangdong Basic and Applied Research (Grant No. 2019B030302002), Program for Guangdong Introducing Innovative and Entrepreneurial Teams (Grant No. 2016ZT06D211).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
Jiazhi Jiang, Dan Huang, Yutong Lu & Xiangke Liao
School of Software Engineering, South China University of Technology, Guangzhou, 510006, China
Hu Chen

Authors

Jiazhi Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Dan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yutong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Xiangke Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Dan Huang or Hu Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, J., Huang, D., Chen, H. et al. HTDcr: a job execution framework for high-throughput computing on supercomputers. Sci. China Inf. Sci. 67, 112104 (2024). https://doi.org/10.1007/s11432-022-3657-3

Download citation

Received: 27 July 2022
Revised: 13 October 2022
Accepted: 06 December 2022
Published: 22 December 2023
DOI: https://doi.org/10.1007/s11432-022-3657-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HTDcr: a job execution framework for high-throughput computing on supercomputers

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Performance improvement of the triangular matrix product in commodity clusters

WebAssembly as an Enabler for Next Generation Serverless Computing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HTDcr: a job execution framework for high-throughput computing on supercomputers

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

Performance improvement of the triangular matrix product in commodity clusters

WebAssembly as an Enabler for Next Generation Serverless Computing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation