On-demand Data Analytics in HPC Environments at Leadership Computing Facilities: Challenges and Experiences

John Harney, Seung-Hwan Lim, Sreenivas Sukumar, Dale Stansberry, Peter Xenopoulos (National Center for Computational Sciences, Oak Ridge National Laboratory)

The construction of data analysis infrastructures that handle continuously accumulating data is quickly becoming an essential requirement for many organizations such as the U.S. Department of Energy (DOE). While DOE supports some of the largest computing facilities in the world, new analysis infrastructures like Apache Spark are difficult to implement. In this paper, we propose an on-demand Spark service that mitigates these difficulties, allowing facility users to flexibly create Spark instances quickly and easily. We define a systematic approach for creating these Spark instances and validate that optimal performance benefits are maintained. Using a series of benchmarks for algorithms that are commonly used in scientific workflows, we compared the behavior of Spark tasks using facility resources with that of an open research cloud that has a dedicated Spark infrastructure deployed. Finally, we leveraged a scientific use case from the Center of Nanophase Materials at the Oak Ridge National Laboratory to demonstrate the utility of using Spark in the computing facility.