BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160905Z
LOCATION:C2/3/4 Ballroom
DTSTART;TZID=America/Chicago:20181114T083000
DTEND;TZID=America/Chicago:20181114T170000
UID:submissions.supercomputing.org_SC18_sess323@linklings.com
SUMMARY:Research Posters
DESCRIPTION:Poster\nTech Program Reg Pass, Exhibits Reg Pass\n\nExploring 
 Application Performance on Fat-Tree Networks in the Presence of Congestion
 \n\nTaffet, Rao, Karlin\n\nNetwork congestion, which occurs when multiple 
 applications simultaneously use shared links in cluster network, can cause
  poor communication performance, decreasing the performance and scalabilit
 y of parallel applications. Many studies are performed while clusters also
  run other production workloads...\n\n---------------------\nMulti-Client 
 DeepIO for Large-Scale Deep Learning on HPC Systems\n\nZhu, Chowdhury, Fu,
  Moody, Mohror...\n\nWith the growth of computation power, leadership High
 -Performance Computing (HPC) systems can train larger datasets for Deep ne
 ural networks (DNNs) more efficiently. On HPC systems, a training dataset 
 is on a parallel file system or node-local storage devices. However, not a
 ll HPC clusters have node...\n\n---------------------\nEnergy Efficiency o
 f Reconfigurable Caches on FPGAs\n\nWang, Li, Geng, Herbordt\n\nThe perfor
 mance of a given cache architecture depends largely on the applications th
 at run on it. Even though each application has its best-suited cache confi
 guration, vendors of fixed HPC systems must provide compromise designs. Re
 configurable caches can adjust cache configuration dynamically to ge...\n\
 n---------------------\nRGB (Redfish Green500 Benchmarker): A Green500 Ben
 chmarking Tool Using Redfish\n\nHojati, Chen, Sill, Hass\n\nPerformance an
 d energy are important factors for supercomputers and data-centers with a 
 trade-off between them. Energy efficiency metric considers both of these p
 roperties.  The Green500 is a branch of Top500 project which provides a li
 st of supercomputers based on energy efficiency. It has a manual...\n\n---
 ------------------\nOptimization of Ultrasound Simulations on Multi-GPU Se
 rvers\n\nVaverka, Spetko, Treeby, Jaros\n\nRealistic ultrasound simulation
 s have found a broad area of applications in preoperative photoacoustic sc
 reening and non-invasive ultrasound treatment planing. However, the domain
 s are typically thousands of wavelengths in size, leading to large-scale n
 umerical models with billions of unknowns. The ...\n\n--------------------
 -\nGPGPU Performance Estimation with Core and Memory Frequency Scaling\n\n
 Wang, Chu\n\nGraphics processing units (GPUs) support dynamic voltage and 
 frequency scaling to balance computational performance and energy consumpt
 ion. However, simple and accurate performance estimation for a given GPU k
 ernel under different frequency settings is still lacking for real hardwar
 e, which is impor...\n\n---------------------\nMaking Sense of Scientific 
 Simulation Ensembles\n\nDahshan, Polys\n\nScientists run many simulations 
 with varying initial conditions, known as "ensembles", to understand the i
 nfluence and relationships among multiple parameters or ensemble members. 
 Most of the ensemble visualization and analysis approaches and techniques 
 focus on analyzing the relationships between e...\n\n---------------------
 \nWhich Architecture Is Better Suited for Matrix-Free Finite-Element Algor
 ithms: Intel Skylake or Nvidia Volta?\n\nKronbichler, Allalen, Ohlerich, W
 all\n\nThis work presents a performance comparison of highly tuned matrix-
 free finite element kernels from the finite element library on different c
 ontemporary computer architectures, NVIDIA V100 and P100 GPUs, an Intel Kn
 ights Landing Xeon Phi, and two multi-core Intel CPUs (Broadwell and Skyla
 ke).  The a...\n\n---------------------\nSpotSDC: an Information Visualiza
 tion System to Analyze Silent Data Corruption\n\nLi, Menon, Livnat, Mohror
 , Pascucci\n\nAggressive technology scaling trends are expected to make th
 e hardware of HPC systems more susceptible to transient faults. Transient 
 faults in hardware may be masked without affecting the program output, cau
 se a program to crash, or lead to silent data corruptions (SDC). While fau
 lt injection studi...\n\n---------------------\nHigh-Accuracy Scalable Sol
 utions to the Dynamic Facility Layout Problem\n\nQasem, Novoa, Kolla, Coyl
 e\n\nThe dynamic facility layout problem (DFLP) is concerned with finding 
 arrangements of facilities within plant locations that minimize the sum of
  material handling and relocation costs over a planning horizon. DFLP is r
 elevant in manufacturing engineering; accurate solutions can reduce operat
 ional cos...\n\n---------------------\nHPC-as-a-Service for Life Sciences\
 n\nSvaton, Martinovic, Jeliazkova, Chupakhin, Tomancak...\n\nHPC-as-a-Serv
 ice is a well-known term in the area of high performance computing. It ena
 bles users to access an HPC infrastructure without a need to buy and manag
 e their own infrastructure. Through this service, academia and industry ca
 n take advantage of the technology without an upfront investment ...\n\n--
 -------------------\nSciGaP: Apache Airavata Hosted Science Gateways\n\nPi
 erce, Marru, Abeysinghe, Pamidighantam, Christie...\n\nThe goal of the Sci
 ence Gateways Platform as a service (SciGaP.org) project is to provide cor
 e services for building and hosting science gateways. Over the last two ye
 ars, SciGaP services have been used to build and host over twenty-five sci
 ence gateways. SciGaP services support these gateways throu...\n\n--------
 -------------\nReproducibility as Side Effect\n\nWang, Zhen, Anderson, Kea
 hey\n\nThe ability to keep records and reproduce experiments is a critical
  element of the scientific method for any discipline. However, the recordi
 ng and publishing of research artifacts that allow to reproduce and direct
 ly compare against existing research continue to be a challenge. In this p
 aper, we pr...\n\n---------------------\nUsing Darshan and CODES to Evalua
 te Application I/O Performance\n\nKhetawat, Zimmer, Mueller, Vazhkudai, At
 chley\n\nBurst buffers have become increasingly popular in HPC systems, al
 lowing bursty I/O traffic to be serviced faster without slowing down appli
 cation execution. The ubiquity of burst buffers creates opportunities for 
 studying their ideal placement in the HPC topology. Furthermore, the topol
 ogy of the ne...\n\n---------------------\nGPU-Accelerated Interpolation f
 or 3D Image Registration\n\nHimthani, Mang, Gholami, Biros\n\nImage regist
 ration is a key technology in image computing with numerous applications i
 n medical imaging. Our overarching goal is the design of a consistent and 
 unbiased computational framework for the integration of medical imaging da
 ta with simulation and optimization to support clinical decision m...\n\n-
 --------------------\nHIVE: A Cross-Platform, Modular Visualization Ecosys
 tem for Heterogeneous Computational Environments\n\nNonaka, Ono, Sakamoto,
  Hayashi, Kawanabe...\n\nHPC operational environments usually have support
 ing computational systems for assisting pre- and post-processing activitie
 s such as the visualization and analysis of simulation results. A wide var
 iety of hardware systems can be found at different HPC sites, and in our c
 ase, we have a  CPU-only (x86...\n\n---------------------\nImproving the I
 /O Performance and Memory Usage of the Xolotl Cluster Dynamics Simulator\n
 \nRoth, Blondel, Bernholdt, Wirth\n\nXolotl is a cluster dynamics simulato
 r used to predict gas bubble evolution in solids. It is currently being us
 ed to simulate bubble formation in the plasma-facing surface within fusion
  reactors and the nuclear fuel used in fission reactors. After observing p
 erformance problems in coupled-code simul...\n\n---------------------\nPer
 formance Evaluation of the Shifted Cholesky QR Algorithm for Ill-Condition
 ed Matrices\n\nFukaya, Kannan, Nakatsukasa, Yamamoto, Yanagisawa\n\nThe Ch
 olesky QR algorithm, which computes the QR factorization of a matrix, is a
  simple yet efficient algorithm for high-performance computing. However it
  suffers from numerical instability. In a recent work, this instability ha
 s been remedied by repeating Cholesky QR twice (CholeskyQR2).  ChokeskyQ..
 .\n\n---------------------\nLarge Scale MPI-Parallelization of LBM and DEM
  Systems: Accelerating Research by Using HPC\n\nJelinek, Mason, Peters, Jo
 hnson, Brumfield...\n\nCasting, solidification, and the behavior of dry, s
 aturated, and partially saturated granular media are examples of interesti
 ng and important problems in multiple areas of civil, mechanical, and chem
 ical engineering. For interacting particle-fluid systems, the Discrete Ele
 ment Method (DEM) and Latti...\n\n---------------------\nHermes: a Multi-T
 iered Distributed I/O Buffering System for HDF5\n\nDevarajan\n\nHigh-Perfo
 rmance Computing (HPC) systems’ increasing ability to run data-intensive p
 roblems at larger scale and resolution has driven the evolution of modern 
 storage technologies. In addition, extreme amounts of data are collected b
 y large scientific instruments and sensor network is resulting in a ...\n\
 n---------------------\nWorkflow for Parallel Processing of Sequential Mes
 h Databases\n\nMeca, Říha, Brzobohatý\n\nThis poster presents a workf
 low for parallel loading of sequentially stored mesh databases. It can be 
 used as a connection between tools for the creation of complex engineering
  models along with parallel solvers to allow broader usage of HPC by the e
 ngineering community. Scalability tests show that ...\n\n-----------------
 ----\nThe NAStJA Framework: Non-Collective Scalable Global Communications\
 n\nBerghoff, Kondov\n\nIn recent years, simulations in various areas of sc
 ience and engineering have proven to be very useful.  To efficiently deplo
 y simulation codes on current and future high-performance computer systems
 , high node level performance, scalable communication and the exclusion of
  unnecessary calculations a...\n\n---------------------\nHardware Accelera
 tion of CNNs with Coherent FPGAs\n\nSefat, Aslan, Qasem\n\nThis paper desc
 ribes a new flexible approach to implementing energy-efficient CNNs on FPG
 As. Our design leverages the Coherent Accelerator Processor Interface (CAP
 I) which provides a cache-coherent view of system memory to attached accel
 erators. Convolution layers are formulated as matrix multiplica...\n\n----
 -----------------\nDistributed Fast Boundary Element Methods\n\nMerta, Zap
 letal, Kravcenko\n\nWe present a parallel implementation of the fast bound
 ary element method (BEM) for the Helmholtz equation. After a brief descrip
 tion of BEM, vectorization of the computationally most demanding kernels, 
 and shared memory parallelization, we focus on the distributed memory para
 llelization using a new ...\n\n---------------------\nDevelopment of Numer
 ical Coupled Analysis Method by Air Flow Analysis and Snow Accretion Analy
 sis\n\nMurotani, Nakade, Kamata, Takahashi\n\nIn this research, to take co
 untermeasures for the snow accretion damage, we developed a simulator of r
 ealizing the snow accretion process in the following steps. Firstly, air f
 low analysis is performed by “Airflow simulator” developed by RTRI (Railwa
 y Technical Research Institute). Secondly, traject...\n\n-----------------
 ----\nPortable Parallel Performance via Multi-Dimensional Homomorphisms\n\
 nRasch, Schulze, Gorlatch\n\nAchieving portable performance over different
  parallel architectures and varying problem sizes is hard: e.g., a program
  optimized for multi-core CPUs on large input sizes can significantly diff
 er from the same program optimized for Graphics Processing Units (GPUs) on
  small sizes.\n\nWe propose an appr...\n\n---------------------\nWarpX: To
 ward Exascale Modeling of Plasma Particle Accelerators\n\nThevenet, Vay, A
 lmgren, Bell, Lehe...\n\nTurning the current experimental plasma accelerat
 or state-of-the-art from a promising technology into mainstream scientific
  tools depends critically on high-performance, high-fidelity modeling of c
 omplex processes that develop over a wide range of space and time scales. 
 As part of the U.S. Departmen...\n\n---------------------\nEnabling Data A
 nalytics Workflows Using Node-Local Storage\n\nDo, Jiang, Gallagher, Chu, 
 Harrison...\n\nThe convergence of high-performance computing (HPC) and Big
  Data is a necessity with the push toward extreme-scale computing. As HPC 
 simulations become more complex, the analytics need to process larger amou
 nts of data, which poses significant challenges for coupling HPC simulatio
 ns with Big Data an...\n\n---------------------\nOpeNNdd: Open Neural Netw
 orks for Drug Discovery: Creating Free and Easy Methods for Designing Medi
 cine\n\nKroencke, Shacterman, Pavini, Samudio, Crivelli\n\nBringing new me
 dicines to patients can be prohibitively expensive in terms of time, cost,
  and resources.  This leaves many diseases without therapeutic interventio
 ns.  In addition, new and reemerging diseases are increasing in prevalence
  across the globe at an alarming rate.  The speed and scale of ...\n\n----
 -----------------\nSC18 Research Posters\n\n\n\nSC18 Research Posters will
  be on display on Tuesday, Wednesday, Thursday from 8:30am to 5pm in the C
 2/3/4 Ballroom.\n\n---------------------\nFeatherCNN: Fast Inference Compu
 tation with TensorGEMM on ARM Architectures\n\nLan, Meng, Hundt, Schmidt, 
 Deng...\n\nThis poster presents a fast inference computation library for A
 RM architecture named as CNNForward. CNNForward is trying to improve the e
 fficiency of inference computation for convolutional neural networks on AR
 M-based multi-core and many-core architectures using both mathematical for
 mula reconstruc...\n\n---------------------\nBoosting the Scalability of C
 ar-Parrinello Molecular Dynamics Simulations for Multi- and Manycore Archi
 tectures\n\nKlöffel, Meyer, Mathias\n\nWe present our recent optimizations
  of the ultra-soft pseudo-potential (USPP) code path of the ab inito molec
 ular dynamics program CPMD (www.cpmd.org). Following the internal instrume
 ntation of CPMD, all relevant USPP routines have been revised to fully sup
 port hybrid MPI+OpenMP parallelization. For...\n\n---------------------\nC
 haracterizing Declustered Software RAID for Enhancing Storage Reliability 
 and Performance\n\nQiao, Fu, Chen, Settlemyer\n\nRedundant array of indepe
 ndent disks (RAID) has been widely used to address the reliability issue i
 n storage systems. As the scale of modern storage systems continues growin
 g, disk failure becomes the norm. With ever-increasing disk capacity, RAID
  recovery based on disk rebuild becomes more costly, ...\n\n--------------
 -------\nParallel Implementation of Machine Learning-Based Many-Body Poten
 tials on CPU and GPU\n\nZhai, Danandeh, Tan, Gao, Paesani...\n\nMachine le
 arning models can be used to develop highly accurate and efficient many-bo
 dy potentials for molecular simulations based on the many-body expansion o
 f the total energy.  A prominent example is the MB-pol water model that em
 ploys permutationally invariant polynomials (PIPs) to represent the ...\n\
 n---------------------\nImplementing Efficient Data Compression and Encryp
 tion in a Persistent Key-Value Store for HPC\n\nKim, Vetter\n\nRecently, p
 ersistent data structures, like key-value stores (KVSs), which are stored 
 in an HPC system's nonvolatile memory, provide an attractive solution for 
 a number of emerging challenges like limited I/O performance. This paper i
 nvestigates how to efficiently integrate data compression and encry...\n\n
 ---------------------\nA Parallel-Efficient GPU Package for Multiphase Flo
 w in Realistic Nano-Pore Networks\n\nXia, Blumers, Li, Luo, Goral...\n\nSi
 mulations of fluid flow in oil/gas shale rocks are challenging in part due
  to the heterogeneous pore sizes ranging from a few nanometers to a few mi
 crometers. Additionally, the complex fluid-solid interaction occurring phy
 sically and chemically must be captured with high resolution. To address t
 he...\n\n---------------------\nProcessing-in-Storage Architecture for Mac
 hine Learning and Bioinformatics\n\nKaplan, Yavits, Ginosar\n\nUser-genera
 ted and bioinformatics database volumes has been increasing exponentially 
 for more than a decade. With the slowdown and approaching end of Moore's l
 aw, traditional technologies cannot satisfy the increasing demands for pro
 cessing power.   This work presents PRINS, a highly-parallel in-sto...\n\n
 ---------------------\nKernel-Based and Total Performance Analysis of CGYR
 O on 4 Leadership Systems\n\nSfiligoi, Candy, Belli\n\nWe present the resu
 lts of an exhaustive performance analysis of the CGYRO code on 4 leadershi
 p systems spanning 5 different configurations (2 KNL-based, 1 Skylake-base
 d, and 2 hybrid CPU-GPU architectures). CGYRO is an Eulerian gyrokinetic s
 olver designed and optimized for collisional, electromagnet...\n\n--------
 -------------\nRedesigning The Absorbing Boundary Algorithm for Asynchrono
 us High Performance Acoustic Wave Propagation\n\nAbdelkhalak, Akbudak, Eti
 enne, Tonellot\n\nExploiting high concurrency, relaxing the synchrony of e
 xisting algorithms, and increasing data reuse have immense effect in perfo
 rmance. We integrate the Multicore-optimized Wavefront Diamond (MWD) tilin
 g approach by Malas et al. [SIAM SISC, 2015, ACM Trans. Parallel Comput. 2
 017],  which takes int...\n\n---------------------\nCapsule Networks for P
 rotein Structure Classification\n\nRosa de Jesus, Cuevas Paniagua, Rivera,
  Crivelli\n\nCapsule Networks have great potential to tackle problems in s
 tructural biology because of their attention to hierarchical relationships
 . This work describes the implementation and application of a capsule netw
 ork architecture to the classification of RAS protein family structures on
  GPU-based comput...\n\n---------------------\nCross-Layer Group Regulariz
 ation for Deep Neural Network Pruning\n\nGao, Liu\n\nImproving weights spa
 rsity is a common strategy for deep neural network pruning. Most existing 
 methods use regularizations that only consider structural sparsity within 
 an individual layer. In this paper, we propose a cross-layer group regular
 ization taking into account the statistics from multiple ...\n\n----------
 -----------\nMachine Learning for Adaptive Discretization in Massive Multi
 scale Biomedical Modeling\n\nHan, Gupta, Zhang, Bluestein, Deng\n\nFor mul
 tiscale problems, traditional time stepping algorithms use a single smalle
 st time stepsize in order to capture the finest details; using this scale 
 leads to a significant waste of computing resources for simulating coarse-
 grained portion of the problem. To improve computing efficiency for mul...
 \n\n---------------------\nMulti-GPU Accelerated Non-Hydrostatic Numerical
  Ocean Model with GPUDirect RDMA Transfers\n\nYamagishi, Matsumura, Hasumi
 \n\nWe have implemented our “kinaco” numerical ocean model on Tokyo Univer
 sity’s Reedbush supercomputer, which utilizes the latest Nvidia Pascal P10
 0 GPUs with GPUDirect technology. We have also optimized the model’s Poiss
 on/Helmholtz solver by adjusting the global memory alignment and thread bl
 ock conf...\n\n---------------------\nA Locality and Memory Congestion-Awa
 re Thread Mapping Method for Modern NUMA Systems\n\nAgung, Amrizal, Egawa,
  Takizawa\n\nOn modern NUMA systems, the memory congestion problem could d
 egrade performance more than the memory access locality problem because a 
 large number of processor cores in the systems can cause heavy congestion 
 on memory controllers. In this work, we propose a thread mapping method th
 at considers the ...\n\n---------------------\nTuning CFD Applications for
  Intel Xeon Phi with TAU Commander and ParaTools ThreadSpotter\n\nBeekman,
  Chaimov, Shende, Malony, Bisek...\n\nTuning and understanding the perform
 ance characteristics of computational fluid dynamics (CFD) codes on many-c
 ore, NUMA architectures is challenging. One must determine how programming
  choices impact algorithm performance and how best to utilize the availabl
 e memory caches, high-bandwidth memory, an...\n\n---------------------\nMa
 ssively Parallel Stress Chain Characterization for Billion Particle DEM Si
 mulation of Accretionary Prism Formation\n\nFuruichi, Nishiura, Hori\n\nHe
 rein, a novel algorithm for characterizing stress chains using a large par
 allel computer system is presented. Stress chains are important for analyz
 ing the results of large-scale discrete element method (DEM) simulations. 
 However, the general algorithm is difficult to parallelize especially when
  s...\n\n---------------------\nRefactoring and Optimizing Multiphysics Co
 mbustion Models for Data Parallelism\n\nStone, Poludnenko, Taylor\n\nHigh-
 fidelity combustion simulations combine high-resolution computational flui
 d dynamics numerical methods with multi-physics models to capture chemical
  kinetics and transport processes. These multi-physics models can dominate
  the computation cost of the simulation. Due to the high cost of combusti.
 ..\n\n---------------------\nInteractive HPC Deep Learning with Jupyter No
 tebooks\n\nBhimji, Farrell, Evans, Henderson, Cholia...\n\nDeep learning r
 esearchers are increasingly using Jupyter notebooks to implement interacti
 ve, reproducible workflows. Such solutions are typically deployed on small
 -scale (e.g. single server) computing systems. However, as the sizes and c
 omplexities of datasets and associated neural network models in...\n\n----
 -----------------\nFast and Accurate Training of an AI Radiologist\n\nWils
 on, Gundecha, Varadharajan, Filby, Yang...\n\nThe health care industry is 
 expected to be an early adopter of AI and deep learning to improve patient
  outcomes, reduce costs, and speed up diagnosis. We have developed models 
 for using AI to diagnose pneumonia, emphysema, and other thoracic patholog
 ies from chest x-rays. Using the Stanford Universi...\n\n-----------------
 ----\nFull State Quantum Circuit Simulation by Using Lossy Data Compressio
 n\n\nWu, Di, Cappello, Finkel, Alexeev...\n\nIn order to evaluate, validat
 e, and refine the design of a new quantum algorithm or a quantum computer,
  researchers and developers need methods to assess their correctness and f
 idelity. This requires the capabilities of simulation for full quantum sta
 te amplitudes. However, the number of quantum sta...\n\n------------------
 ---\nAn Efficient SIMD Implementation of Pseudo-Verlet Lists for Neighbor 
 Interactions in Particle-Based Codes\n\nWillis, Schaller, Gonnet\n\nIn par
 ticle-based simulations, neighbour finding (i.e. finding pairs of particle
 s to interact within a given range) is the most time consuming part of the
  computation. One of the best such algorithms, which can be used for both 
 Molecular Dynamics (MD) and Smoothed Particle Hydrodynamics (SPH) simula..
 .\n\n---------------------\nUnderstanding Potential Performance Issues Usi
 ng Resource-Based alongside Time Models\n\nding, Lee, Xue, Zheng\n\nNumero
 us challenges and opportunities are introduced by the complexity and enorm
 ous code legacy of HPC applications, the diversity of HPC architectures, a
 nd the nonlinearity of interactions between applications and HPC systems. 
 To address these issues, we propose the Resource-based Alongside Time (R..
 .\n\n---------------------\nMPI/OpenMP parallelization of the Fragment Mol
 ecular Orbitals Method in GAMESS\n\nMironov, Alexeev, Fedorov\n\nIn this w
 ork, we present a novel parallelization strategy for the Fragment Molecula
 r Orbital (FMO) method in the quantum chemistry package GAMESS. The origin
 al FMO code has been parallelized only with MPI, which limits scalability 
 of the code on multi-core massively parallel machines. To address thi...\n
 \n---------------------\nAutomatic Generation of Mixed-Precision Programs\
 n\nMoody, Pinnow, Lam, Menon, Schordan...\n\nFloating-point arithmetic is 
 foundational to scientific computing in HPC, and choices about floating-po
 int precision can have a significant effect on the accuracy and speed of H
 PC codes. Unfortunately, current precision optimization tools require sign
 ificant user interaction, and few work on the sca...\n\n------------------
 ---\nUPC++ and GASNet-EX: PGAS Support for Exascale Applications and Runti
 mes\n\nBaden, Hargrove, Ahmed, Bachan, Bonachea...\n\nLawrence Berkeley Na
 tional Lab is developing a programming system to support HPC application d
 evelopment using the Partitioned Global Address Space (PGAS) model. This w
 ork is driven by the emerging need for adaptive, lightweight communication
  in irregular applications at exascale.  We present an ove...\n\n---------
 ------------\nEnabling Reproducible Microbiome Science through Decentraliz
 ed Provenance Tracking in QIIME 2\n\nNaimey, Keefe\n\nIn this poster, we d
 emonstrate the ways in which automatic, integrated, decentralized provenan
 ce tracking in QIIME 2, a leading microbiome bioinformatics platform, enab
 les reproducible microbiome science. We use sample data from a recent stud
 y of arid soil microbiomes  (Significant Impacts of Increa...\n\n---------
 ------------\nOptimizing Next Generation Hydrodynamics Code for Exascale S
 ystems\n\nAkhmetova, Lakshmiranganatha, Mukherjee, Oullet, Payne...\n\nStu
 dying continuum dynamics problems computationally can illuminate complex p
 hysical phenomena where experimentation is too costly. However, the models
  used in studying these phenomena usually require intensive calculations, 
 some of which are beyond even the largest supercomputers to date. Emerging
  ...\n\n---------------------\nMGRIT Preconditioned Krylov Subspace Method
 \n\nYoda, Fujii, Tanaka\n\nMGRIT re-discretize the problem with larger tim
 e-step width at the coarse-levels, which often cause unstable convergence.
  We propose a Krylov subspace method with MGRIT preconditioning as a more 
 stable solver. For unstable problems, MGRIT preconditioned Krylov subspace
  method performed better than M...\n\n---------------------\nEnabling Neut
 rino and Antineutrino Appearance Observation Measurements with HPC Facilit
 ies\n\nBuchanan, Calvez, Ding, Doyle, Himmel...\n\nWhen fitting to data wi
 th low statistics and near physical boundaries, extra measures need to be 
 taken to ensure proper statistical coverage. The method NOvA uses is calle
 d the Feldman-Cousins procedure, which entails fitting thousands of indepe
 ndent pseudoexperiments to generate acceptance interval...\n\n------------
 ---------\nLarge Scale Computation of Quantiles Using MELISSA\n\nRibes, Te
 rraz, Fournier, Iooss, Raffin\n\nQuantiles being order statistics, the cla
 ssical approach for their computation requires availability of the full sa
 mple before ranking it. This approach is not suitable at exascale. Large e
 nsembles would need to gather a prohibitively large amount of data. We pro
 pose an iterative approach based on t...\n\n---------------------\nFlowOS-
 RM: Disaggregated Resource Management System\n\nTakano, Suzaki, Koie\n\nA 
 traditional data center consists of monolithic-servers is confronted with 
 limitations including lack of operational flexibility, low resource utiliz
 ation, low maintainability, etc. Resource disaggregation is a promising so
 lution to address the above issues. We propose a concept of disaggregated 
 da...\n\n---------------------\nProgramming the EMU Architecture: Algorith
 m Design Considerations for Migratory-Threads-Based Systems\n\nBelviranli,
  Lee, Vetter\n\nThe decades-old memory bottleneck problem for data-intensi
 ve applications is getting worse as the processor core counts continue to 
 increase. Workloads with sparse memory access characteristics only achieve
  a fraction of a system's total memory bandwidth. EMU architecture provide
 s a radical approach...\n\n---------------------\nOpenACC to FPGA: A Direc
 tive-Based High-Level Programming Framework for High-Performance Reconfigu
 rable Computing\n\nLee, Lambert, Kim, Vetter, Malony\n\nAccelerator-based 
 heterogeneous computing has become popular solutions for power-efficient h
 igh performance computing (HPC).  Along these lines, Field Programmable Ga
 te Arrays (FPGAs) have offered more advantages in terms of performance and
  energy efficiency for specific workloads than other acceler...\n\n-------
 --------------\nTensor-Optimized Hardware Accelerates Fused Discontinuous 
 Galerkin Simulations\n\nBreuer, Heinecke, Cui\n\nIn recent years the compu
 te/memory balance of processors has been continuously shifting towards com
 pute. The rise of Deep Learning, based on matrix multiplications, accelera
 ted this path, especially in terms of single precision and lower precision
  compute. An important research question is if this d...\n\n--------------
 -------\nAI Matrix – Synthetic Benchmarks for DNN\n\nWei, Xu, Jin, Zhang, 
 Zhang\n\nThe current AI benchmarks suffer from a number of drawbacks. Firs
 t, they cannot adapt to the emerging changes of deep learning (DL) algorit
 hms and are fixed once selected. Second, they contain tens to hundreds of 
 applications and have very long running time. Third, they are mainly selec
 ted from open...\n\n---------------------\nApplying the Execution-Cache-Me
 mory Model: Current State of Practice\n\nHager, Eitzinger, Hornich, Cremon
 esi, Alappat...\n\nThe ECM (Execution-Cache-Memory) model is an analytic, 
 resource-based  performance model for steady-state loop code running on mu
 lticore processors. Starting from a machine model, which describes the int
 eraction between the code and the hardware, and static code analysis, it a
 llows an accurate predi...\n\n---------------------\nPerformance Evaluatio
 n of the NVIDIA Tesla V100: Block Level Pipelining vs. Kernel Level Pipeli
 ning\n\nCui, Scogland, de Supinski, Feng\n\nAs accelerators become more co
 mmon, expressive and performant, interfaces for them become ever more impo
 rtant. Programming models like OpenMP offer simple-to-use but powerful dir
 ective-based offload mechanisms. By default, these models naively copy dat
 a to or from the device without overlapping comp...\n\n-------------------
 --\nJob Simulation for Large-Scale PBS-Based Clusters with the Maui Schedu
 ler\n\nZitzlsberer, Jansik, Martinovic\n\nFor large-scale High Performance
  Computing centers with a wide range of different projects and heterogeneo
 us infrastructures, efficiency is an important consideration. Understandin
 g how compute jobs are scheduled is necessary for improving the job schedu
 ling strategies in order to optimize cluster u...\n\n---------------------
 \nScript of Scripts Polyglot Notebook and Workflow System\n\nWang, Leong, 
 Peng\n\nComputationally intensive disciplines such as computational biolog
 y often use tools implemented in different languages and analyze data on h
 igh-performance computing systems. Although scientific workflow systems ca
 n powerfully execute large-scale data-processing, they are not suitable fo
 r ad hoc dat...\n\n---------------------\nEnabling High-Level Graph Proces
 sing via Dynamic Tasking\n\nDrocco, Castellana, Minutoli, Tumeo, Feo\n\nDa
 ta-intensive computing yields irregular and unbalanced workloads, in parti
 cular on large-scale problems running on distributed systems. Task-based r
 untime systems are commonly exploited to implement higher-level data-centr
 ic programming models, promoting multithreading and asynchronous coordinat
 io...\n\n---------------------\nTensorfolding: Improving Convolutional Neu
 ral Network Performance with Fused Microkernels\n\nAnderson, Georganas, Av
 ancha, Heinecke\n\nConvolution layers are prevalent in many classes of dee
 p neural networks, including Convolutional Neural Networks (CNNs) which pr
 ovide state-of-the-art results for tasks like image recognition, neural ma
 chine translation and speech recognition. In the recent past, several tech
 niques to improve gener...\n\n---------------------\nBinarized ImageNet In
 ference in 29us\n\nGeng, Li, Wang, Song, Herbordt\n\nWe propose a single-F
 PGA-based accelerator for ultra-low-latency inference of ImageNet in this 
 work. The design can complete the inference of Binarized AlexNet within 29
 us with accuracy comparable to other BNN implementations.  We achieve this
  performance with the following contributions: 1. We comp...\n\n----------
 -----------\nToward Smoothing Data Movement Between RAM and Storage\n\nAlt
 urkestani, Tonellot, Etienne, Ltaief\n\nWe propose to design and implement
  a software framework, which provides a Multilayer Buffer System (MBS) to 
 cache in/out datasets into CPU main memory from/to slower storage media, s
 uch as parallel file systems (e.g., Lustre), solid-state drive (e.g., Burs
 t Buffer) or non-volatile RAM. Although MBS ...\n\n---------------------\n
 MATEDOR: MAtrix, TEnsor, and Deep-Learning Optimized Routines\n\nAbdelfatt
 ah, Dongarra, Tomov, Yamazaki, Haidar\n\nThe MAtrix, TEnsor, and Deep-lear
 ning Optimized Routines (MATEDOR) project develops software technologies a
 nd standard APIs, along with a sustainable and portable library, for large
 -scale computations that can be broken down into very small matrix or tens
 or computations. The main target of MATEDOR i...\n\n---------------------\
 nAccelerating Wave-Propagation Algorithms with Adaptive Mesh Refinement Us
 ing the Graphics Processing Unit (GPU)\n\nQin, LeVeque, Motley\n\nClawpack
  is a library for solving nonlinear hyperbolic partial differential equati
 ons using high-resolution finite volume methods based on Riemann solvers a
 nd limiters. It supports Adaptive Mesh Refinement (AMR), which is essentia
 l in solving multi-scale problems. Recently, we added capabilities to ...\
 n\n---------------------\nDistributed Adaptive Radix Tree for Efficient Me
 tadata Search on HPC Systems\n\nZhang, Tang, Byna, Chen\n\nAffix-based sea
 rch allows users to retrieve data without the need to remember all relevan
 t information precisely. While building an inverted index to facilitate ef
 ficient affix-based search is a common practice for standalone databases a
 nd desktop file systems, they are often insufficient for high-p...\n\n----
 -----------------\nImproving Error-Bounded Lossy Compression for Cosmologi
 cal N-Body Simulation\n\nLi, Di, Liang, Chen, Cappello\n\nCosmological sim
 ulations may produce extremely large amount of data, such that its success
 ful run depends on large storage capacity and huge I/O bandwidth, especial
 ly in the exascale computing scale. Effective error-bounded lossy compress
 ors with both high compression ratios and low data distortion ...\n\n-----
 ----------------\nVeloC: Very Low Overhead Checkpointing System\n\nNicolae
 , Cappello, Moody, Gonsiorowski, Mohror\n\nCheckpointing large amounts of 
 related data concurrently to stable storage is a common I/O pattern of man
 y HPC applications. However, such a pattern frequently leads to I/O bottle
 necks that lead to poor scalability and performance. As modern HPC infrast
 ructures continue to evolve, there is a growing...\n\n--------------------
 -\nEstimating Molecular Dynamics Chemical Shift with GPUs\n\nWright, Ferra
 to\n\nExperimental chemical shifts (CS) from solution and solid state magi
 c-angle-spinning nuclear magnetic resonance spectra provide atomic level d
 ata for each amino acid within a protein or complex. However, structure de
 termination of large complexes and assemblies based on NMR data alone rema
 ins challe...\n\n---------------------\nUsing Thrill to Process Scientific
  Data on HPC\n\nKarabin, Chen, Suresh, Jimenez, Lo...\n\nWith ongoing impr
 ovement of computational power and memory capacity, the volume of scientif
 ic data keeps growing. To gain insights from vast amounts of data, scienti
 sts are starting to look at Big Data processing and analytics tools such a
 s Apache Spark. In this poster, we explore Thrill, a framewor...\n\n------
 ---------------\nGPU Acceleration at Scale with OpenPower Platforms in Cod
 e_Saturne\n\nAntao, Moulinec, Fournier, Sawko, Zimon...\n\nCode_Saturne is
  a widely used computational fluid dynamics software package that uses fin
 ite-volume methods to simulate different kinds of flows tailored to tackle
  multi-bilion-cell unstructured mesh simulations. This class of codes has 
 shown to be challenging to accelerate on GPUs as they consist o...\n\n----
 -----------------\nLarge-Message Size Allreduce at Wire Speed for Distribu
 ted Deep Learning\n\nTanaka, Arikawa, Kawai, Kato, Ito...\n\nIn large-scal
 e distributed deep learning, the Allreduce operation for large messages (1
 00 KB or more) is critical for gathering gradients from multiple worker no
 des and broadcasting the sum of the gradients to them. When the message is
  large, the latency in Allreduce operation would make it difficul...\n\n--
 -------------------\nSol: Transparent Neural Network Acceleration Platform
 \n\nWeber\n\nWith the usage of neural networks in a wide range of applicat
 ion fields, the necessity to execute these efficiently on high performance
  hardware is one of the key problems for artificial intelligence (AI) fram
 ework providers. More and more new specialized hardware types and correspo
 nding libraries a...\n\n---------------------\nDetection of Silent Data Co
 rruptions in Smooth Particle Hydrodynamics Simulations\n\nCavelan, Ciorba,
  Cabezón\n\nSoft errors, such as silent data corruptions (SDCs) hinder the
  correctness of large-scale scientific applications. Ghost replication (GR
 ) is proposed herein as the first SDCs detector relying on the fast error 
 propagation inherent to applications that employ the smooth particle hydro
 dynamics (SPH) m...\n\n---------------------\nDeepSim-HiPAC: Deep Learning
  High Performance Approximate Calculation for Interactive Design and Proto
 typing\n\nAl-Jarro, Georgescu, Tomita, Nakashima\n\nWe present a data-driv
 en technique that can learn from physical-based simulations for the instan
 t prediction of field distribution for 3D objects. Such techniques are ext
 remely useful when considering, for example, computer aided engineering (C
 AE), where computationally expensive simulations are oft...\n\n-----------
 ----------\nTop-Down Performance Analysis of Workflow Applications\n\nHero
 ld, Williams\n\nScientific simulation frameworks are common to use on HPC 
 systems. They contain parallelized algorithms and provide various solvers 
 for a specific application domain. Usually, engineers execute multiple ste
 ps to solve a particular problem which are often distributed over multiple
  jobs. Finding perfo...\n\n---------------------\nConvolutional Neural Net
 works for Coronary Plaque Classification in Intravascular Optical Coherenc
 e Tomography (IVOCT) Images\n\nKolluru, Prabhu, Gharaibeh, Wilson, Gajurel
 \n\nCurrently, IVOCT is the only imaging technique with the resolution nec
 essary to identify vulnerable thin cap fibro-atheromas (TCFAs). IVOCT also
  has greater penetration depth in calcified plaques as compared to Intrava
 scular Ultrasound (IVUS). Despite its advantages, IVOCT image interpretati
 on is ch...\n\n---------------------\nCompiling SIMT Programs on Multi- an
 d Many-Core Processors with Wide Vector Units: A Case Study with CUDA\n\nW
 u, Ravi, Becchi\n\nThere has been an increasing interest in SIMT programmi
 ng tools for multi- and manycore (co)processors with wide vector extension
 s. In this work, we study the effective implementation of a SIMT programmi
 ng model (a subset of CUDA C) on Intel platforms with 512-bit vector exten
 sions (hybrid MIMD/SIMD...\n\n---------------------\nAn Alternative Approa
 ch to Teaching Bigdata and Cloud Computing Topics at CS Undergraduate Leve
 l\n\nDeb, Fuad, Irwin\n\nBig data and cloud computing collectively offer a
  paradigm shift in the way businesses are now acquiring, using and managin
 g information technology. This creates the need for every CS student to be
  equipped with foundation knowledge in this collective paradigm and to pos
 sess some hands-on-experience...\n\n---------------------\nA Massively Par
 allel Evolutionary Markov Chain Monte Carlo Algorithm for Sampling Complic
 ated Multimodal State SpacesState\n\nCho, Liu\n\nWe develop an Evolutionar
 y Markov Chain Monte Carlo (EMCMC) algorithm for sampling from large multi
 -modal state spaces. Our algorithm combines the advantages of evolutionary
  algorithms (EAs) as optimization heuristics and the theoretical convergen
 ce properties of Markov Chain Monte Carlo (MCMC) algo...\n\n--------------
 -------\nMLModelScope: Evaluate and Measure Machine Learning Models within
  AI Pipelines\n\nDakkak, Li, Hwu, Xiong\n\nThe current landscape of Machin
 e Learning (ML) and Deep Learning (DL) is rife with non-uniform frameworks
 , models, and system stacks but lacks standard tools to facilitate the eva
 luation and measurement of models. Due to the absence of such tools, the c
 urrent practice for evaluating and comparing th...\n\n--------------------
 -\nA Compiler Framework for Fixed-Topology Non-Deterministic Finite Automa
 ta on SIMD Platforms\n\nNourian, Wu, Becchi\n\nAutomata traversal accelera
 tion has been studied on various parallel platforms. Many existing acceler
 ation methods store finite automata states and transitions in memory. For 
 these designs memory size and bandwidth are the main limiting factors to p
 erformance and power efficiency. Many applications,...\n\n----------------
 -----\nA Low-Communicaton Method to Solve Poisson's Equation on Locally-St
 ructured Grids\n\nVan Straalen, McCorquodale, Colella, Kavouklis\n\nThis p
 oster describes a new algorithm, Method of Local Corrections (MLC), and a 
 high-performance implementation for solving Poisson's equation with infini
 te-domain boundary conditions, on locally-refined nested rectangular grids
 .  The data motion is comparable to that of only a single V-cycle of mul..
 .\n\n---------------------\nFloating-Point Autotuner for CPU-Based Mixed-P
 recision Applications\n\nGu, Beata, Becchi\n\nIn this poster, we present t
 he design and development of an autotuning tool for floating-point code. T
 he goal is to balance accuracy and performance in order to produce an effi
 cient and accurate mixed-precision program. The tuner starts by maximizing
  accuracy through the use of a high-precision libr...\n
END:VEVENT
END:VCALENDAR