A. A. Bondarenko, M. V. Iakobovski, “Fault tolerance for HPC by using local checkpoints”, Vestn. YuUrGU. Ser. Vych. Matem. Inform., 3:3 (2014), 20

Vestnik Yuzhno-Ural'skogo Gosudarstvennogo Universiteta. Seriya "Vychislitelnaya Matematika i Informatika"

RUS ENG

JOURNALS PEOPLE ORGANISATIONS CONFERENCES SEMINARS VIDEO LIBRARY PACKAGE AMSBIB

JavaScript is disabled in your browser. Please switch it on to enable full functionality of the website

	General information
	Latest issue
	Archive

	Search papers
	Search references

	RSS
	Latest issue
	Current issues
	Archive issues
	What is RSS

Vestn. YuUrGU. Ser. Vych. Matem. Inform.:
Year:
Volume:
Issue:
Page:
	Find

Personal entry:
Login:
Password:
	Save password
	Enter
	Forgotten password?
	Register

Vestnik Yuzhno-Ural'skogo Gosudarstvennogo Universiteta. Seriya "Vychislitelnaya Matematika i Informatika", 2014, Volume 3, Issue 3, Pages 20–36 (Mi vyurv46)

This article is cited in 2 scientific papers (total in 2 papers)

Computer Science, Engineering and Control

Fault tolerance for HPC by using local checkpoints

A. A. Bondarenko, M. V. Iakobovski

Keldysh Institute of Applied Mathematics (Moscow, Russian Federation)

Full-text PDF (360 kB) Citations (2)

References:

PDF

HTML

Abstract: One of the main problems that occur in the area of high-performance computing is to continue computations despite of failures. In this paper, we consider the main definitions relating to dependability, briefly review the failure rates for distributed systems and also survey the rollback-recovery approaches. The classic fault-tolerance technique used in parallel applications is the co-ordinated checkpointing protocol. This protocol takes a consistent global checkpoint snapshot by capturing the local state of each process node simultaneously and saves it on a parallel file system via I/O nodes. However, as the number of compute nodes increases and the size of applications grow, the performance overhead of this protocol can reach an unacceptable level. A solution to this problem is to use local storage for checkpointing. To provide protection, it is necessary to du-plicate checkpoints to other local storages. In this work, we develop user level approach and pre-sent scheme for checkpointing to the local storages. We proof that, if the number of failures is less than the maximum allowable value for the scheme then it is possible to recover from consistent global checkpoint.

Keywords: parallel computing, fault tolerance, checkpoint, MPI.

Funding agency	Grant number
Russian Foundation for Basic Research	13-01-12073 офи_м

Received: 05.08.2014

Document Type: Article

UDC: 004.052.3

Language: Russian

Citation: A. A. Bondarenko, M. V. Iakobovski, “Fault tolerance for HPC by using local checkpoints”, Vestn. YuUrGU. Ser. Vych. Matem. Inform., 3:3 (2014), 20–36

Citation in format AMSBIB

\Bibitem{BonIak14}

\by A.~A.~Bondarenko, M.~V.~Iakobovski

\paper Fault tolerance for HPC by using local checkpoints

\jour Vestn. YuUrGU. Ser. Vych. Matem. Inform.

\yr 2014

\vol 3

\issue 3

\pages 20--36

\mathnet{http://mi.mathnet.ru/vyurv46}

Linking options:

https://www.mathnet.ru/eng/vyurv46

https://www.mathnet.ru/eng/vyurv/v3/i3/p20

This publication is cited in the following 2 articles:

Citing articles in Google Scholar: Russian citations, English citations
Related articles in Google Scholar: Russian articles, English articles

Vestnik Yuzhno-Ural'skogo Gosudarstvennogo Universiteta. Seriya "Vychislitelnaya Matematika i Informatika"

Statistics & downloads:
Abstract page:	265
Full-text PDF :	90
References:	33

Что такое QR-код?

Registration to the website

Logotypes