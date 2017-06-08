Home Google Cloud: Fastest track to Apache Hadoop and Spark success

0
A combination of rapid startup time, per-minute billing and cloud-native architecture is transformative for operators.

It’s 2017, and running Apache Hadoop (or Apache Spark) is still too hard, whether on-premise or in the cloud. Ironically, the main source of this difficulty is one of Hadoop’s strengths: the ability to run multitenant workloads. When concurrency is limited to one user (which in most enterprises happens rarely, if ever), that first Terasort on your shiny, new idle cluster runs relatively fast. Under real-world load, with dozens of analysts running brute-force, join-heavy jobs that create massive resource contention, your cluster could grind to a halt.

Using YARN for resource management is helpful for mitigating this contention, but at the cost of increased complexity: YARN makes it difficult to understand the resource utilization of each job, diagnose performance bottlenecks or do job accounting and resource chargebacks. (Furthermore, even if YARN helps you improve utilization, you’ll still pay for resources that you don’t use.) Thus the all-too-common solution for any and all issues is to ask for more nodes.

Read the entire article here, Fastest track to Apache Hadoop and Spark success: using job-scoped clusters on cloud-native architecture

via the fine folks at Google Cloud

