Vestnik Yuzhno-Ural'skogo Gosudarstvennogo Universiteta. Seriya "Vychislitelnaya Matematika i Informatika"
RUS  ENG    JOURNALS   PEOPLE   ORGANISATIONS   CONFERENCES   SEMINARS   VIDEO LIBRARY   PACKAGE AMSBIB  
General information
Latest issue
Archive

Search papers
Search references

RSS
Latest issue
Current issues
Archive issues
What is RSS



Vestn. YuUrGU. Ser. Vych. Matem. Inform.:
Year:
Volume:
Issue:
Page:
Find






Personal entry:
Login:
Password:
Save password
Enter
Forgotten password?
Register


Vestnik Yuzhno-Ural'skogo Gosudarstvennogo Universiteta. Seriya "Vychislitelnaya Matematika i Informatika", 2014, Volume 3, Issue 3, Pages 20–36 (Mi vyurv46)  

This article is cited in 2 scientific papers (total in 2 papers)

Computer Science, Engineering and Control

Fault tolerance for HPC by using local checkpoints

A. A. Bondarenko, M. V. Iakobovski

Keldysh Institute of Applied Mathematics (Moscow, Russian Federation)
Full-text PDF (360 kB) Citations (2)
References:
Abstract: One of the main problems that occur in the area of high-performance computing is to continue computations despite of failures. In this paper, we consider the main definitions relating to dependability, briefly review the failure rates for distributed systems and also survey the rollback-recovery approaches. The classic fault-tolerance technique used in parallel applications is the co-ordinated checkpointing protocol. This protocol takes a consistent global checkpoint snapshot by capturing the local state of each process node simultaneously and saves it on a parallel file system via I/O nodes. However, as the number of compute nodes increases and the size of applications grow, the performance overhead of this protocol can reach an unacceptable level. A solution to this problem is to use local storage for checkpointing. To provide protection, it is necessary to du-plicate checkpoints to other local storages. In this work, we develop user level approach and pre-sent scheme for checkpointing to the local storages. We proof that, if the number of failures is less than the maximum allowable value for the scheme then it is possible to recover from consistent global checkpoint.
Keywords: parallel computing, fault tolerance, checkpoint, MPI.
Funding agency Grant number
Russian Foundation for Basic Research 13-01-12073 офи_м
Received: 05.08.2014
Document Type: Article
UDC: 004.052.3
Language: Russian
Citation: A. A. Bondarenko, M. V. Iakobovski, “Fault tolerance for HPC by using local checkpoints”, Vestn. YuUrGU. Ser. Vych. Matem. Inform., 3:3 (2014), 20–36
Citation in format AMSBIB
\Bibitem{BonIak14}
\by A.~A.~Bondarenko, M.~V.~Iakobovski
\paper Fault tolerance for HPC by using local checkpoints
\jour Vestn. YuUrGU. Ser. Vych. Matem. Inform.
\yr 2014
\vol 3
\issue 3
\pages 20--36
\mathnet{http://mi.mathnet.ru/vyurv46}
Linking options:
  • https://www.mathnet.ru/eng/vyurv46
  • https://www.mathnet.ru/eng/vyurv/v3/i3/p20
  • This publication is cited in the following 2 articles:
    Citing articles in Google Scholar: Russian citations, English citations
    Related articles in Google Scholar: Russian articles, English articles
    Vestnik Yuzhno-Ural'skogo Gosudarstvennogo Universiteta. Seriya "Vychislitelnaya Matematika i Informatika"
    Statistics & downloads:
    Abstract page:265
    Full-text PDF :90
    References:33
     
      Contact us:
     Terms of Use  Registration to the website  Logotypes © Steklov Mathematical Institute RAS, 2024