Identifying Performance Bottlenecks in Hive: Use of Processor Counters

Alexander C. Shulyak, Lizy K. John (The University of Texas at Austin)

Distributed SQL Query Engines, like Hive, Spark, and Impala, have become the de-facto database set-up for De- cision Support Systems with large database sizes. Unlike other distributed computing like graph processing and OLTP transactions, DSS queries are often CPU bound as opposed to Network I/O bound. In this paper, we identify apparent anomalies in query performance on a distributed Hive database setup, comparing it to a tra- ditional database (MySQL) system. We found MySQL is algorithmically efficient, and micro-architectural per- formance is affected only by the computational demands of the query. On the other-hand, Hive’s generic execu- tion framework has difficulty converting some queries into a set of MapReduce jobs, increasing algorithmic inefficency. We describe our experience of analyzing the root causes of the performance discrepancies between the MySQL and Hive executions. JVM startup overhead and context switches are seen to be a major bottleneck in Hive executions. Hive is also seen to use a larger code footprint which stresses the memory hierarchy. In ad- dition to using the Tez execution engine, which reuses containers across MapReduce tasks, the aforementioned costs could be reduced by running small queries together in a single session, and by forcing one process per available physical core.