Leveraging Large Sensor Streams for Robust Cloud Control

Alok Singh (San Diego Supercomputer Center), Eric Stephan, Todd Elsethagen, Matt MacDuff, Bibi Raju, Malachi Schram (Pacific Northwest National Laboratory), Kerstin Kleese van Dam (Brookhaven National Laboratory), Darren J Kerbyson (Pacific Northwest National Laboratory), Ilkay Altintas (San Diego Supercomputer Center)

Today’s dynamic computing deployment for commercial and scientific applications is propelling us to an era where minor inefficiencies can snowball into significant performance and operational bottlenecks. Data center operations is increasingly relying on sensors based control systems for key decision insights. The increased sampling frequencies, cheaper storage costs and prolific deployment of sensors is producing massive volumes of operational data. However, there is a lag between rapid development of analytical techniques and its widespread practical deployment. We present empirical evidence of the potential carried by analytical techniques for operations management in computing and data centers. Using Machine Learning modeling techniques on data from a real instrumented cluster, we demonstrate that predictive modeling on operational sensor data can directly reduce systems operations monitoring costs and improve system reliability.