Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Research Database / FORSCHUNGSDATENBANK

Publication 4495908 | Verified

Data Entry: Please note that the research database will be replaced by UNIverse by the end of October 2023. Please enter your data into the system https://universe-intern.unibas.ch. Thanks

Login for users with Unibas email account...

Login for registered users without Unibas email account...

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

JournalArticle (Originalarbeit in einer wissenschaftlichen Zeitschrift)

ID	4495908
Author(s)	Benoit, Anne; Cavelan, Aurélien; Cappello, Franck; Raghavan, Padma; Robert, Yves; Sun, Hongyang
Author(s) at UniBasel	Cavelan, Aurélien Ciorba, Florina M.
Year	2018
Title	Coping with silent and fail-stop errors at scale by combining replication and checkpointing
Journal	Journal of parallel and distributed computing
Volume	122
Pages / Article-Number	209-225
Keywords	number of processors, resilience, replication, silent errors, silent data corruptions, SDC, detection, correction, duplication, triplication, voting, optimal
Abstract	This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors, as well as to cope with both silent and fail-stop errors on large-scale platforms. Fail-stop errors are immediately detected, unlike silent errors for which a detection mechanism is required. To detect silent errors, many application-specific techniques are available, either based on algorithms (ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject only to silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to correct some silent errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. If not, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck. We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model.
Publisher	Elsevier
ISSN/ISBN	0743-7315
edoc-URL	https://edoc.unibas.ch/68675/
Full Text on edoc	No
Digital Object Identifier DOI	10.1016/j.jpdc.2018.08.002
ISI-Number	WOS:000448232400017
Document type (ISI)	Article

02/05/2024

Research Database / FORSCHUNGSDATENBANK