Vestnik Yuzhno-Ural'skogo Gosudarstvennogo Universiteta. Seriya "Vychislitelnaya Matematika i Informatika"
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Vestn. YuUrGU. Ser. Vych. Matem. Inform.:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Vestnik Yuzhno-Ural'skogo Gosudarstvennogo Universiteta. Seriya "Vychislitelnaya Matematika i Informatika", 2019, Volume 8, Issue 2, Pages 76–91
DOI: https://doi.org/10.14529/cmse190205
(Mi vyurv213)
 

Coordinated checkpointing with sender-based logging and asynchronous recovery from failure

A. A. Bondarenko, P. A. Lyakhov, M. V. Yakobovskiy

Keldysh Institute of Applied Mathematics Russian Academy of Sciences (sq. Miusskaya 4, Moscow, 125047 Russia)
References:
Abstract: The increasing growth in the number of components of supercomputers leads HPC specialists to unfavorable estimates for future supercomputers: "the range of the mean time between failures will be from 1 hour to 9 hours". This estimate leads to the problem of long calculations on supercomputers. In this paper, we propose a recovery method from failure which does not require rollback for all processes. This method can reduce overhead costs for some computational algorithms. The standard fault tolerance method consists of two phases: coordinated checkpointing and rollback of all processes to the last checkpoint in the case of a failure. The proposed method includes coordinated checkpointing with sender-based logging and asynchronous recovery when most processes wait and several processes recalculate the lost data. We developed parallel programs to solve the problem of heat transfer in the thin plate which computation algorithm has a small amount of data for logging. In these programs, failures occur by calling the function raise (SIGKILL), coordinated or asynchronous recovery is performed by ULFM functions. In order to obtain theoretical estimates of overhead costs, we propose a simulation model of program execution with failures. This model assumes that failures strike during the computations, checkpointing and recovery. We made a comparison of recovery methods with different failure rates. The comparison showed that the use of asynchronous recovery results in a reduction of overhead costs by theoretical estimates from 22 % to 40 %, and by computational experiments from 13 % to 53 %.
Keywords: MPI, ULFM extension, coordinated checkpointing, asynchronous recovery, fault tolerance.
Funding agency Grant number
Russian Foundation for Basic Research 17-07-01604 а
Received: 20.11.2018
Bibliographic databases:
Document Type: Article
UDC: 004.052.3
Language: Russian
Citation: A. A. Bondarenko, P. A. Lyakhov, M. V. Yakobovskiy, “Coordinated checkpointing with sender-based logging and asynchronous recovery from failure”, Vestn. YuUrGU. Ser. Vych. Matem. Inform., 8:2 (2019), 76–91
Citation in format AMSBIB
\Bibitem{BonLyaIak19}
\by A.~A.~Bondarenko, P.~A.~Lyakhov, M.~V.~Yakobovskiy
\paper Coordinated checkpointing with sender-based logging and asynchronous recovery from failure
\jour Vestn. YuUrGU. Ser. Vych. Matem. Inform.
\yr 2019
\vol 8
\issue 2
\pages 76--91
\mathnet{http://mi.mathnet.ru/vyurv213}
\crossref{https://doi.org/10.14529/cmse190205}
\elib{https://elibrary.ru/item.asp?id=38073495}
Linking options:
  • https://www.mathnet.ru/eng/vyurv213
  • https://www.mathnet.ru/eng/vyurv/v8/i2/p76
  • Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Vestnik Yuzhno-Ural'skogo Gosudarstvennogo Universiteta. Seriya "Vychislitelnaya Matematika i Informatika"
    Statistics & downloads:
    Abstract page:168
    Full-text PDF :52
    References:23
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024