BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260522T150123Z
LOCATION:D221
DTSTART;TZID=America/Chicago:20181111T161000
DTEND;TZID=America/Chicago:20181111T163000
UID:submissions.supercomputing.org_SC18_sess162_ws_cre102@linklings.com
SUMMARY:Reproducibility for Streaming Analysis
DESCRIPTION:Christopher J. Wright (Columbia University), Line Pouchard (Br
 ookhaven National Laboratory), and Simon J. L. Billinge (Columbia Universi
 ty)\n\nThe natural and physical sciences increasingly need streaming data 
 processing for live data analysis and autonomous experimentation. Furtherm
 ore, data provenance and replicability are important to assure the veracit
 y of scientific results. Here we describe a software system that combines 
 high performance computing, streaming data processing, and automatic data 
 provenance capturing to address this need. Data provenance and streaming d
 ata processing share a common data structure, the directed acyclic graph (
 DAG), which describes the order of each computational step. Data processin
 g requires the DAG to specify what computations to run in what order, and 
 the execution can be recreated from the graph, reproducing the analyzed da
 ta and capturing provenance. In our framework the description and ordering
  of the analysis steps (the pipeline) are separated from their execution (
 the streaming analysis) and the DAG created for the streaming data process
 ing is captured during data analysis. Streaming data can have high through
 puts and our system allows users to choose among multiple parallel process
 ing backends, including Dask. To guarantee reproducibility, unique links t
 o the incoming data, and their timestamps are captured alongside the DAG. 
 Analyzed data, along with provenance metadata, are stored in a database, w
 hich can re-run analysis from raw data, enabling verification of results, 
 exploring how parameters change outcomes, and data processing reuse. This 
 system is running in production at the National Synchrotron Light Source-I
 I (NSLS-II) x-ray powder diffraction beamlines.\n\nTag: Exascale, Hot Topi
 cs, Reproducibility, Scientific Computing\n\nRegistration Category: Worksh
 op Reg Pass\n\nSession Chairs: Walid Keyrouz (National Institute of Standa
 rds and Technology (NIST)) and Michael V. Mascagni (Florida State Universi
 ty, National Institute of Standards and Technology (NIST))\n\n
END:VEVENT
END:VCALENDAR
