Skip to main content

HPC TaskMaster – Task Efficiency Monitoring System for the Supercomputer Center

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1618))

Abstract

This paper is devoted to the monitoring system HPC TaskMaster developed at the HSE University for the cHARISMa cluster. This system automatically evaluates the efficiency of performing tasks of HPC cluster users and identifies inefficient tasks, thereby significantly saving the expensive machine time. In addition, users can view reports on completing their tasks, along with inferences about their work and interactive graphs. Particular attention in this paper is paid to determining the effectiveness of the task – the system allows the administrator to personally configure the criteria for evaluating the effectiveness of the task without the need for changes in the source code. The system is developed using open-source software and is publicly available for use on other clusters.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Open Source/HPC TaskMaster GitLab. https://git.hpc.hse.ru/open-source/hpc-taskmaster

  2. Slurm Workload Manager - acct_gather.conf. https://slurm.schedmd.com/acct_gather.conf.html

  3. Chan, N.: A resource utilization analytics platform using grafana and telegraf for the Savio supercluster. In: ACM International Conference Proceeding Series. Association for Computing Machinery (2019). https://doi.org/10.1145/3332186.3333053

  4. Kostenetskiy, P.S., Chulkevich, R.A., Kozyrev, V.I.: HPC resources of the higher school of economics. J. Phys. Conf. Ser. 1740, 012050 (2021). https://doi.org/10.1088/1742-6596/1740/1/012050

  5. Kraeva, Y., Zymbler, M.: Scalable algorithm for subsequence similarity search in very large time series data on cluster of phi KNL. In: Manolopoulos, Y., Stupnikov, S. (eds.) DAMDID/RCDL 2018. CCIS, vol. 1003, pp. 149–164. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-23584-0_9

    Chapter  Google Scholar 

  6. Kychkin, A., Deryabin, A., Vikentyeva, O., Shestakova, L.: Architecture of compressor equipment monitoring and control cyber-physical system based on influxdata platform. In: 2019 International Conference on Industrial Engineering, Applications and Manufacturing, ICIEAM 2019 (2019). https://doi.org/10.1109/ICIEAM.2019.8742963

  7. Nikitenko, D., et al.: JobDigest - detailed system monitoring-based supercomputer application behavior analysis. In: Voevodin, V., Sobolev, S. (eds.) Supercomputing. Communications in Computer and Information Science, vol. 793, pp. 516–529. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71255-0_42

    Chapter  Google Scholar 

  8. Nikitenko, D.A., Voevodin, V.V., Zhumatiy, S.A.: Deep analysis of job state statistics on Lomonosov-2 supercomputer. Supercomput. Front. Innov. 5(2), 4–10 (2018). https://doi.org/10.14529/jsfi180201

    Article  Google Scholar 

  9. Rohl, T., Eitzinger, J., Hager, G., Wellein, G.: Likwid monitoring stack: a flexible framework enabling job specific performance monitoring for the masses (2017). https://doi.org/10.1109/CLUSTER.2017.115

  10. Safonov, A., Kostenetskiy, P., Borodulin, K., Melekhin, F.: A monitoring system for supercomputers of SUSU. In: Proceedings of Russian Supercomputing Days International Conference, vol. 1482, pp. 662–666. CEUR-WS (2015)

    Google Scholar 

  11. Wegrzynek, A., Vino, G.: The evolution of the ALICE O 2 monitoring system. In: EPJ Web of Conferences, vol. 245 (2020). https://doi.org/10.1051/epjconf/202024501042

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pavel Kostenetskiy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kostenetskiy, P., Shamsutdinov, A., Chulkevich, R., Kozyrev, V., Antonov, D. (2022). HPC TaskMaster – Task Efficiency Monitoring System for the Supercomputer Center. In: Sokolinsky, L., Zymbler, M. (eds) Parallel Computational Technologies. PCT 2022. Communications in Computer and Information Science, vol 1618. Springer, Cham. https://doi.org/10.1007/978-3-031-11623-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-11623-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-11622-3

  • Online ISBN: 978-3-031-11623-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics