|
This article is cited in 1 scientific paper (total in 1 paper)
Hardware, software and distributed supercomputer systems
Monitoring
applications on the ZHORES cluster at Skoltech
I. E. Zakharov, O. A. Panarin, S. G. Rykovanov, R. R. Zagidullin, A. K. Malyutin, Yu. N. Shkandybin, A. E. Ermekova Skolkovo Institute of Science and Technology
Abstract:
Standard monitoring tools for cluster computing systems allow
assessing the performance of the whole system, but do not allow to analyze the
performance of applications individually. A monitoring system for measuring the
resources requested by each application separately was written in Skoltech for the
high-performance Zhores cluster. The monitoring system collects both, the usual
metrics of CPU and GPU utilization, as well as the CPU and GPU event counters
which allow a more detailed analysis of the resources requested by the application.
Service programs deployed on each node in the cluster send measurements to a
common time series database in one second increments. These data are analyzed
offline to isolate the characteristics associated with the use of computing resources
by each application. This should reveal suboptimal applications, allow fine-tuning
of the cluster functions and improve the HPC system overall.
Key words and phrases:
cluster, high performance computing, application monitoring, CPU/GPU event counters, time series database.
Received: 26.01.2021 29.03.2021 Accepted: 05.06.2021
Citation:
I. E. Zakharov, O. A. Panarin, S. G. Rykovanov, R. R. Zagidullin, A. K. Malyutin, Yu. N. Shkandybin, A. E. Ermekova, “Monitoring
applications on the ZHORES cluster at Skoltech”, Program Systems: Theory and Applications, 12:2 (2021), 73–103
Linking options:
https://www.mathnet.ru/eng/ps383 https://www.mathnet.ru/eng/ps/v12/i2/p73
|
Statistics & downloads: |
Abstract page: | 125 | Full-text PDF : | 70 | References: | 27 |
|