|
Vestnik Yuzhno-Ural'skogo Gosudarstvennogo Universiteta. Seriya "Vychislitelnaya Matematika i Informatika", 2014, Volume 3, Issue 3, Pages 20–36
(Mi vyurv46)
|
|
|
|
This article is cited in 2 scientific papers (total in 2 papers)
Computer Science, Engineering and Control
Fault tolerance for HPC by using local checkpoints
A. A. Bondarenko, M. V. Iakobovski Keldysh Institute of Applied Mathematics (Moscow, Russian Federation)
Abstract:
One of the main problems that occur in the area of high-performance computing is to continue computations despite of failures. In this paper, we consider the main definitions relating to dependability, briefly review the failure rates for distributed systems and also survey the rollback-recovery approaches. The classic fault-tolerance technique used in parallel applications is the co-ordinated checkpointing protocol. This protocol takes a consistent global checkpoint snapshot by capturing the local state of each process node simultaneously and saves it on a parallel file system via I/O nodes. However, as the number of compute nodes increases and the size of applications grow, the performance overhead of this protocol can reach an unacceptable level. A solution to this problem is to use local storage for checkpointing. To provide protection, it is necessary to du-plicate checkpoints to other local storages. In this work, we develop user level approach and pre-sent scheme for checkpointing to the local storages. We proof that, if the number of failures is less than the maximum allowable value for the scheme then it is possible to recover from consistent global checkpoint.
Keywords:
parallel computing, fault tolerance, checkpoint, MPI.
Received: 05.08.2014
Citation:
A. A. Bondarenko, M. V. Iakobovski, “Fault tolerance for HPC by using local checkpoints”, Vestn. YuUrGU. Ser. Vych. Matem. Inform., 3:3 (2014), 20–36
Linking options:
https://www.mathnet.ru/eng/vyurv46 https://www.mathnet.ru/eng/vyurv/v3/i3/p20
|
Statistics & downloads: |
Abstract page: | 265 | Full-text PDF : | 90 | References: | 33 |
|