BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160904Z
LOCATION:C2/3/4 Ballroom
DTSTART;TZID=America/Chicago:20181113T083000
DTEND;TZID=America/Chicago:20181113T170000
UID:submissions.supercomputing.org_SC18_sess325_spost107@linklings.com
SUMMARY:Holistic Root Cause Analysis of Node Failures in Production HPC
DESCRIPTION:ACM Student Research Competition, Poster\nTech Program Reg Pas
 s, Exhibits Reg Pass\n\nHolistic Root Cause Analysis of Node Failures in P
 roduction HPC\n\nDas\n\nProduction HPC clusters endure failures incurring 
 computation and resource wastage. Despite the presence of various failure 
 detection and prediction schemes, a comprehensive understanding of how nod
 es fail considering various components and layers of the system is require
 d for sustained resilience. This work performs a holistic root cause diagn
 osis of node failures using a measurement-driven approach on contemporary 
 system logs that can help vendors and system administrators support exasca
 le resilience.\n\nOur work shows that lead times can be increased by at le
 ast 5 times if external subsystem correlations are considered as opposed t
 o considering the events of a specific node in isolation. Moreover, when d
 etecting sensor measurement outliers and interconnect related failures, tr
 iggering automated recovery events can exacerbate the situation if recover
 y is unsuccessful.
URL:https://sc18.supercomputing.org/presentation/?id=spost107&sess=sess325
END:VEVENT
END:VCALENDAR

