HPC TaskMaster – Task Efficiency Monitoring System for the Supercomputer Center

Kostenetskiy, Pavel; Shamsutdinov, Artemiy; Chulkevich, Roman; Kozyrev, Vyacheslav; Antonov, Dmitriy

doi:10.1007/978-3-031-11623-0_2

HPC TaskMaster – Task Efficiency Monitoring System for the Supercomputer Center

Pavel Kostenetskiy⁷,
Artemiy Shamsutdinov⁷,
Roman Chulkevich⁷,
Vyacheslav Kozyrev⁷ &
…
Dmitriy Antonov⁷

Conference paper
First Online: 19 July 2022

366 Accesses
1 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1618))

Abstract

This paper is devoted to the monitoring system HPC TaskMaster developed at the HSE University for the cHARISMa cluster. This system automatically evaluates the efficiency of performing tasks of HPC cluster users and identifies inefficient tasks, thereby significantly saving the expensive machine time. In addition, users can view reports on completing their tasks, along with inferences about their work and interactive graphs. Particular attention in this paper is paid to determining the effectiveness of the task – the system allows the administrator to personally configure the criteria for evaluating the effectiveness of the task without the need for changes in the source code. The system is developed using open-source software and is publicly available for use on other clusters.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Open Source/HPC TaskMaster GitLab. https://git.hpc.hse.ru/open-source/hpc-taskmaster
Slurm Workload Manager - acct_gather.conf. https://slurm.schedmd.com/acct_gather.conf.html
Chan, N.: A resource utilization analytics platform using grafana and telegraf for the Savio supercluster. In: ACM International Conference Proceeding Series. Association for Computing Machinery (2019). https://doi.org/10.1145/3332186.3333053
Kostenetskiy, P.S., Chulkevich, R.A., Kozyrev, V.I.: HPC resources of the higher school of economics. J. Phys. Conf. Ser. 1740, 012050 (2021). https://doi.org/10.1088/1742-6596/1740/1/012050
Kraeva, Y., Zymbler, M.: Scalable algorithm for subsequence similarity search in very large time series data on cluster of phi KNL. In: Manolopoulos, Y., Stupnikov, S. (eds.) DAMDID/RCDL 2018. CCIS, vol. 1003, pp. 149–164. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-23584-0_9
Chapter Google Scholar
Kychkin, A., Deryabin, A., Vikentyeva, O., Shestakova, L.: Architecture of compressor equipment monitoring and control cyber-physical system based on influxdata platform. In: 2019 International Conference on Industrial Engineering, Applications and Manufacturing, ICIEAM 2019 (2019). https://doi.org/10.1109/ICIEAM.2019.8742963
Nikitenko, D., et al.: JobDigest - detailed system monitoring-based supercomputer application behavior analysis. In: Voevodin, V., Sobolev, S. (eds.) Supercomputing. Communications in Computer and Information Science, vol. 793, pp. 516–529. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71255-0_42
Chapter Google Scholar
Nikitenko, D.A., Voevodin, V.V., Zhumatiy, S.A.: Deep analysis of job state statistics on Lomonosov-2 supercomputer. Supercomput. Front. Innov. 5(2), 4–10 (2018). https://doi.org/10.14529/jsfi180201
Article Google Scholar
Rohl, T., Eitzinger, J., Hager, G., Wellein, G.: Likwid monitoring stack: a flexible framework enabling job specific performance monitoring for the masses (2017). https://doi.org/10.1109/CLUSTER.2017.115
Safonov, A., Kostenetskiy, P., Borodulin, K., Melekhin, F.: A monitoring system for supercomputers of SUSU. In: Proceedings of Russian Supercomputing Days International Conference, vol. 1482, pp. 662–666. CEUR-WS (2015)
Google Scholar
Wegrzynek, A., Vino, G.: The evolution of the ALICE O 2 monitoring system. In: EPJ Web of Conferences, vol. 245 (2020). https://doi.org/10.1051/epjconf/202024501042

Download references

Author information

Authors and Affiliations

HSE University, 11, Pokrovsky boulevard, Moscow, 109028, Russia
Pavel Kostenetskiy, Artemiy Shamsutdinov, Roman Chulkevich, Vyacheslav Kozyrev & Dmitriy Antonov

Authors

Pavel Kostenetskiy
View author publications
You can also search for this author in PubMed Google Scholar
Artemiy Shamsutdinov
View author publications
You can also search for this author in PubMed Google Scholar
Roman Chulkevich
View author publications
You can also search for this author in PubMed Google Scholar
Vyacheslav Kozyrev
View author publications
You can also search for this author in PubMed Google Scholar
Dmitriy Antonov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pavel Kostenetskiy .

Editor information

Editors and Affiliations

South Ural State University, Chelyabinsk, Russia
Leonid Sokolinsky
South Ural State University, Chelyabinsk, Russia
Mikhail Zymbler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kostenetskiy, P., Shamsutdinov, A., Chulkevich, R., Kozyrev, V., Antonov, D. (2022). HPC TaskMaster – Task Efficiency Monitoring System for the Supercomputer Center. In: Sokolinsky, L., Zymbler, M. (eds) Parallel Computational Technologies. PCT 2022. Communications in Computer and Information Science, vol 1618. Springer, Cham. https://doi.org/10.1007/978-3-031-11623-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-11623-0_2
Published: 19 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11622-3
Online ISBN: 978-3-031-11623-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics