Fast Virtualized Hadoop And Spark On All-Flash Disks – VMware White Paper
Best Practices for Optimizing Virtualized Big Data Applications on VMware vSphere 6.5
Best practices are described for optimizing Big Data applications running on VMware vSphere®. Hardware, software, and vSphere configuration parameters are documented, as well as tuning parameters for the operating system, Hadoop, and Spark. The Hewlett Packard Enterprise ProLiant DL380 Gen9 servers used in the test featured fast Intel processors with a large number of cores, large memory (512 GiB), and all-flash disks. Test results are shown from two MapReduce and three Spark applications running on three different configurations of vSphere (with 1, 2, and 4 VMs per host) as well as directly on the hardware. Among the virtualized clusters, the fastest configuration was 4 VMs per host due to NUMA locality and best disk utilization. The 4 VMs per host platform was faster than bare metal for all tests with the exception of a large (10 TB) TeraSort test where the the bare metal advantage of larger memory overcame the disadvantage of NUMA misses.
This paper will show how to best deploy and configure the underlying vSphere infrastructure, as well as the Hadoop cluster, in such an environment. Best practices for all layers of the stack will be documented and their implementation in the test cluster described. The performance of the cluster will be shown both with the TeraSort suite, TestDFSIO HDFS stress tool, and the new Spark machine learning benchmarks.