BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160906Z
LOCATION:C145
DTSTART;TZID=America/Chicago:20181114T150000
DTEND;TZID=America/Chicago:20181114T170000
UID:submissions.supercomputing.org_SC18_sess468_spost135@linklings.com
SUMMARY:Measuring Swampiness: Quantifying Chaos in Large Heterogeneous Dat
 a Repositories
DESCRIPTION:ACM Student Research Competition, Poster\nStudent Program, Tec
 h Program Reg Pass, ACM Student Research Competition\n\nMeasuring Swampine
 ss: Quantifying Chaos in Large Heterogeneous Data Repositories\n\nJung, Wh
 itaker\n\nAs scientific data repositories and filesystems grow in size and
  complexity, they become increasingly disorganized. The coupling of massiv
 e quantities of data with poor organization makes it challenging for scien
 tists to locate and utilize relevant data, thus slowing the process of ana
 lyzing data of interest. To address these issues, we explore an automated 
 clustering approach for quantifying the organization of data repositories.
  Our parallel pipeline processes heterogeneous filetypes (e.g., text and t
 abular data), automatically clusters files based on content and metadata s
 imilarities, and computes a novel "cleanliness" score from the resulting c
 lustering. We demonstrate the generation and accuracy of our cleanliness m
 easure using both synthetic and real datasets, and conclude that it is mor
 e consistent than other potential cleanliness measures.
URL:https://sc18.supercomputing.org/presentation/?id=spost135&sess=sess468
END:VEVENT
END:VCALENDAR