BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160729Z
LOCATION:C141/143/149
DTSTART;TZID=America/Chicago:20181113T113000
DTEND;TZID=America/Chicago:20181113T120000
UID:submissions.supercomputing.org_SC18_sess203_pap111@linklings.com
SUMMARY:Doomsday: Predicting Which Node Will Fail When on Supercomputers
DESCRIPTION:Paper\nGPUs, Resiliency, State of the Practice, System Softwar
 e, Tech Program Reg Pass, BSP Finalist\n\nDoomsday: Predicting Which Node 
 Will Fail When on Supercomputers\n\nDas, Mueller, Hargrove, Roman, Baden\n
 \nPredicting which node will fail and how soon remains a challenge for HPC
  resilience, yet may pave the way to exploiting proactive remedies before 
 jobs fail. Not only for increasing scalability up to exascale systems, but
  even for contemporary supercomputer architectures does it require substan
 tial efforts to distill anomalous events from noisy raw logs. To this end,
  we propose a novel phrase extraction mechanism called TBP (time-based phr
 ases) to pin-point node failures, which is unprecedented.  Our study, base
 d on real system data and statistical machine learning, demonstrates the f
 easibility to predict which specific node will fail in Cray systems. TBP a
 chieves no less than 83% recall rates with lead times as high as 2 minutes
 . This opens up the door for enhancing prediction lead times for supercomp
 uting systems in general, thereby facilitating efficient usage of both com
 puting capacity and power in large scale production systems.
URL:https://sc18.supercomputing.org/presentation/?id=pap111&sess=sess203
END:VEVENT
END:VCALENDAR