Program Systems: Theory and Applications
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive
Guidelines for authors
Submit a manuscript

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Program Systems: Theory and Applications:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Program Systems: Theory and Applications, 2021, Volume 12, Issue 2, Pages 73–103
DOI: https://doi.org/10.25209/2079-3316-2021-12-2-73-103
(Mi ps383)
 

This article is cited in 1 scientific paper (total in 1 paper)

Hardware, software and distributed supercomputer systems

Monitoring applications on the ZHORES cluster at Skoltech

I. E. Zakharov, O. A. Panarin, S. G. Rykovanov, R. R. Zagidullin, A. K. Malyutin, Yu. N. Shkandybin, A. E. Ermekova

Skolkovo Institute of Science and Technology
References:
Abstract: Standard monitoring tools for cluster computing systems allow assessing the performance of the whole system, but do not allow to analyze the performance of applications individually. A monitoring system for measuring the resources requested by each application separately was written in Skoltech for the high-performance Zhores cluster. The monitoring system collects both, the usual metrics of CPU and GPU utilization, as well as the CPU and GPU event counters which allow a more detailed analysis of the resources requested by the application. Service programs deployed on each node in the cluster send measurements to a common time series database in one second increments. These data are analyzed offline to isolate the characteristics associated with the use of computing resources by each application. This should reveal suboptimal applications, allow fine-tuning of the cluster functions and improve the HPC system overall.
Key words and phrases: cluster, high performance computing, application monitoring, CPU/GPU event counters, time series database.
Received: 26.01.2021
29.03.2021
Accepted: 05.06.2021
Document Type: Article
UDC: 004.451
BBC: 32.972.11
MSC: Primary 65Y05; Secondary 68M20, 68M99
Language: Russian
Citation: I. E. Zakharov, O. A. Panarin, S. G. Rykovanov, R. R. Zagidullin, A. K. Malyutin, Yu. N. Shkandybin, A. E. Ermekova, “Monitoring applications on the ZHORES cluster at Skoltech”, Program Systems: Theory and Applications, 12:2 (2021), 73–103
Citation in format AMSBIB
\Bibitem{ZakPanRyk21}
\by I.~E.~Zakharov, O.~A.~Panarin, S.~G.~Rykovanov, R.~R.~Zagidullin, A.~K.~Malyutin, Yu.~N.~Shkandybin, A.~E.~Ermekova
\paper Monitoring
applications on the ZHORES cluster at Skoltech
\jour Program Systems: Theory and Applications
\yr 2021
\vol 12
\issue 2
\pages 73--103
\mathnet{http://mi.mathnet.ru/ps383}
\crossref{https://doi.org/10.25209/2079-3316-2021-12-2-73-103}
Linking options:
  • https://www.mathnet.ru/eng/ps383
  • https://www.mathnet.ru/eng/ps/v12/i2/p73
  • This publication is cited in the following 1 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Program Systems: Theory and Applications
    Statistics & downloads:
    Abstract page:125
    Full-text PDF :70
    References:27
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024