|
Computer Science, Engineering and Control
Simulation of failures in high-performance computing systems under MPI-ULFM
A. A. Bondarenko, M. V. Iakobovski Keldysh Institute of Applied Mathematics (Moscow, Russian Federation)
Abstract:
In this paper, we consider one of the main problems that occur in the area of highperformance computing is to continue computations despite of failures. For the programs running on such systems it is very important to handle failures and continue computations on working nodes. One of the MPI 3.1 standardization efforts aim is adding new techniques, approaches, or concepts to support for fault tolerance in MPI applications. The paper briefly describes a library for simulation of failures and testing fault-tolerant algorithms using functional of developing MPI 3.1 standard. In the test problem we describe one of the techniques of fault tolerance and we compare checkpoint in operational memory versus checkpoint in the distributed file system.
Keywords:
parallel computing, fault tolerance, checkpoint, simulation of failures, MPI, ULFM.
Received: 13.04.2015
Citation:
A. A. Bondarenko, M. V. Iakobovski, “Simulation of failures in high-performance computing systems under MPI-ULFM”, Vestn. YuUrGU. Ser. Vych. Matem. Inform., 4:3 (2015), 5–12
Linking options:
https://www.mathnet.ru/eng/vyurv1 https://www.mathnet.ru/eng/vyurv/v4/i3/p5
|
|