Skip to main content
Log in

HTDcr: a job execution framework for high-throughput computing on supercomputers

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

High-throughput computing (HTC) is a computing paradigm that aims to accomplish jobs by easily breaking them into smaller, independent components. However, it requires a large amount of computing power for a long time. Most existing HTC frameworks are job-oriented without support for coscheduling with hardware architecture and task-level execution. Also, most of the frameworks reach a limited scale, and their usability needs further improvement. Herein, we present HTDcr, a job execution framework for the HTC on supercomputers. This study aims to improve the throughput, task dispatching, and usability of the framework. In detail, the throughput optimizations include a sophisticated designed task management system, a hierarchical scheduler, and the co-optimization of the task-scheduling strategy with the application and hardware characteristics. The optimizations for usability include a programable execution workflow, mechanisms for more robust and reliable service qualities, and a fine-grained resource allocation system for the colocation of multiple jobs. According to our evaluations, HTDcr can achieve outstanding scalability and high throughput on large-scale clusters for the HTC workload. We evaluate HTDcr with several microbenchmarks and real-world applications on Tianhe-2 and Sunway TaihuLight to demonstrate its effects on existing design mechanisms. For instance, the task scheduling for two real-world applications integrated with the application and hardware characteristics achieves 1.7× and 1.9× speedups over the basic task-scheduling strategy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol, 2008, 26: 1135–1145

    Article  Google Scholar 

  2. Houshmand S, Aggarwal S, Flood R. Next gen PCFG password cracking. IEEE Trans Inform Forensic Secur, 2015, 10: 1776–1791

    Article  Google Scholar 

  3. Ward L, Sivaraman G, Pauloski J G, et al. Colmena: scalable machine-learning-based steering of ensemble simulations for high performance computing. In: Proceedings of the IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), 2021. 9–20

  4. Casalino L, Dommer A C, Gaieb Z, et al. AI-driven multiscale simulations illuminate mechanisms of SARS-CoV-2 spike dynamics. Int J High Perform Comput Appl, 2021, 35: 432–451

    Article  Google Scholar 

  5. Caporaso J G, Lauber C L, Walters W A, et al. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J, 2012, 6: 1621–1624

    Article  Google Scholar 

  6. Fajardo E M, Dost J M, Holzman B, et al. How much higher can HTCondor fly? J Phys-Conf Ser, 2015, 664: 062014

    Article  Google Scholar 

  7. Karo M, Lagerstrom R, Kohnke M, et al. The application level placement scheduler. Cray User Group, 2006, 1–7

  8. Yu W, Shen Y X, Li L, et al. Teno: an efficient high-throughput computing job execution framework on Tianhe-2. In: Proceedings of the 20th International Conference on High Performance Computing and Communications, 2018. 408–415

  9. Thain D, Tannenbaum T, Livny M. Distributed computing in practice: the Condor experience. Concurr Comput-Pract Exper, 2005, 17: 323–356

    Article  Google Scholar 

  10. Yoo A B, Jette M A, Grondona M. SLURM: simple Linux utility for resource management. In: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, 2003. 44–60

  11. Goiri l, Le K, Haque M E, et al. Greenslot: scheduling energy consumption in green datacenters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2011. 1–11

  12. Raicu I, Zhao Y, Dumitrescu C, et al. Falkon: a fast and light-weight task execution framework. In: Proceedings of the ACM/IEEE Conference on Supercomputing, 2007. 1–12

  13. Raicu I, Zhang Z, Wilde M, et al. Enabling loosely-coupled serial job execution on the IBM BlueGene/P supercomputer and the SiCortex SC5832. 2008. ArXiv:0808.3536

  14. Merzky A, Turilli M, Titov M, et al. Design and performance characterization of RADICAL-Pilot on leadership-class platforms. IEEE Trans Parallel Distrib Syst, 2021, 33: 818–829

    Article  Google Scholar 

  15. Hagras T, Janeček J. Static vs. Dynamic list-scheduling performance comparison. Acta Polytech, 2003, 43: 6

    Article  Google Scholar 

  16. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM, 2008, 51: 107–113

    Article  Google Scholar 

  17. Shvachko K, Kuang H, Radia S, et al. The hadoop distributed file system. In: Proceedings of the 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010. 1–10

  18. Hursey J, Squyres J M, Mattox T I, et al. The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2007. 1–8

  19. Hammersley J. Monte Carlo Methods. Berlin: Springer, 2013

    Google Scholar 

  20. Liao X, Xiao L, Yang C, et al. MilkyWay-2 supercomputer: system and application. Front Comput Sci, 2014, 8: 345–356

    Article  MathSciNet  Google Scholar 

  21. Fu H, Liao J, Yang J, et al. The Sunway TaihuLight supercomputer: system and applications. Sci China Inf Sci, 2016, 59: 072001

    Article  Google Scholar 

  22. Ye J, McGinnis S, Madden T L. BLAST: improvements for better sequence analysis. Nucleic Acids Res, 2006, 34: 6–9

    Article  Google Scholar 

  23. Hellman M. A cryptanalytic time-memory trade-off. IEEE Trans Inform Theor, 1980, 26: 401–406

    Article  MathSciNet  Google Scholar 

  24. Rivest R. RFC1321: The MD5 Message-digest Algorithm. RFC Editor, 1992

  25. Eastlake D, Jones P. RFC3174: US Secure Hash Algorithm 1 (SHA1). RFC Editor, 2001

Download references

Acknowledgements

This work was supported by National Key R&D Program of China (Grant No. 2021YFB0301300), National Natural Science Foundation of China (Grant No. U1811461), Zhejiang Lab (Grant No. 2021KC0AB04), Major Program of Guangdong Basic and Applied Research (Grant No. 2019B030302002), Program for Guangdong Introducing Innovative and Entrepreneurial Teams (Grant No. 2016ZT06D211).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Dan Huang or Hu Chen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, J., Huang, D., Chen, H. et al. HTDcr: a job execution framework for high-throughput computing on supercomputers. Sci. China Inf. Sci. 67, 112104 (2024). https://doi.org/10.1007/s11432-022-3657-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-022-3657-3

Keywords

Navigation