BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160731Z
LOCATION:D174
DTSTART;TZID=America/Chicago:20181116T084000
DTEND;TZID=America/Chicago:20181116T090000
UID:submissions.supercomputing.org_SC18_sess146_ws_ftxs101@linklings.com
SUMMARY:Toward Ad Hoc Recovery For Soft Errors
DESCRIPTION:Workshop\nResiliency, Scientific Computing, Workshop Reg Pass\
 n\nToward Ad Hoc Recovery For Soft Errors\n\nLosada, Bautista-Gomez, Kelle
 r, Unsal\n\nThe coming exascale era is a great opportunity for high perfor
 mance computing (HPC) applications. However, high failure rates on these s
 ystems will hazard the successful completion of their execution. Bit-flip 
 errors in dynamic random access memory (DRAM) account for a noticeable sha
 re of the failures in supercomputers. Hardware mechanisms, such as error c
 orrecting code (ECC), can detect and correct single-bit errors and can det
 ect some multi-bit errors while others can go undiscovered. Unfortunately,
  detected multi-bit errors will most of the time force the termination of 
 the application and lead to a global restart. Thus, other strategies at th
 e software level are needed to tolerate these type of faults more efficien
 tly and to avoid a global restart. In this work, we extend the FTI checkpo
 inting library to facilitate the implementation of custom recovery strateg
 ies for MPI applications, minimizing the overhead introduced when coping w
 ith soft errors. The new functionalities are evaluated by implementing loc
 al forward recovery on three HPC benchmarks with different reliability req
 uirements. Our results demonstrate a reduction on the recovery times by up
  to 14%.
URL:https://sc18.supercomputing.org/presentation/?id=ws_ftxs101&sess=sess1
 46
END:VEVENT
END:VCALENDAR

