BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260522T150110Z
LOCATION:C2/3/4 Ballroom
DTSTART;TZID=America/Chicago:20181114T083000
DTEND;TZID=America/Chicago:20181114T170000
UID:submissions.supercomputing.org_SC18_sess323_post155@linklings.com
SUMMARY:Tensorfolding: Improving Convolutional Neural Network Performance 
 with Fused Microkernels
DESCRIPTION:Michael Anderson, Evangelos Georganas, Sasikanth Avancha, and 
 Alexander Heinecke (Intel Corporation)\n\nConvolution layers are prevalent
  in many classes of deep neural networks, including Convolutional Neural N
 etworks (CNNs) which provide state-of-the-art results for tasks like image
  recognition, neural machine translation and speech recognition. In the re
 cent past, several techniques to improve generalization capabilities of ne
 ural networks have been developed; the most prominent and successful is ba
 tch normalization. In deep neural network training, the batch normalizatio
 n layer consists of a memory-bandwidth bound kernel. On the latest Intel S
 kylake based Xeon processors, a significant portion of execution time is s
 pent in this kernel. By leveraging the CPU's large caches and its latency-
 optimized execution model, we are able to reduce this kernel's time to a b
 are minimum while allowing to improve forward pass layer runtimes by 21% c
 ompared to an unfused implementation and by 2% compared to a fused impleme
 ntation.\n\nRegistration Category: Tech Program Reg Pass, Exhibits Reg Pas
 s\n\n
END:VEVENT
END:VCALENDAR