Analysis of Node Failures in High Performance Computers Based on System Logs

Research Database / FORSCHUNGSDATENBANK

Publication 3386311 | Verified

Data Entry: Please note that the research database will be replaced by UNIverse by the end of October 2023. Please enter your data into the system https://universe-intern.unibas.ch. Thanks

Login for users with Unibas email account...

Login for registered users without Unibas email account...

Analysis of Node Failures in High Performance Computers Based on System Logs

Other Publications (Forschungsberichte o. ä.)

ID	3386311
Author(s)	Ghiasvand, Siavash; Ciorba, Florina M.; Tschüter, Ronny; Nagel, Wolfgang E.
Author(s) at UniBasel	Ciorba, Florina M.
Year	2015
Title	Analysis of Node Failures in High Performance Computers Based on System Logs
Journal/Series title	28th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2015)
Publication Type	misc
URL	http://sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/tech_poster_pages/post338.html
Keywords	2015
Abstract	The growth in size and complexity of HPC systems leads to a rapid increase of their failure rates. In the near future, it is expected that the mean time between failures of HPC systems becomes too short and that current failure recovery mechanisms will no longer be able to recover the systems from failures. Early failure detection is, thus, essential to prevent their destructive effects. Based on measurements of a production system at TU Dresden over an 8-month time period, we study the correlation of node failures in time and space. We infer possible types of correlations and show that in many cases the observed node failures are directly correlated. The significance of such a study is achieving a clearer understanding of correlations between observed node failures and enabling failure detection as early as possible. The results aimed to help system administrators minimize (or prevent) the destructive effects of failures.
edoc-URL	http://edoc.unibas.ch/40834/
Full Text on edoc	Available

26/04/2024

Research Database / FORSCHUNGSDATENBANK