Pepperdata Casts a Bright Light on Amazon Elastic MapReduce; Reveals Hidden Cost, Performance Metrics

Users of Amazon Elastic MapReduce to run big data and analytics projects are about to be offered a new world of visibility and control. Pepperdata, working with Amazon, is casting light into the dark corners of EMR operations, letting users see, manage and control the performance and costs for their jobs, workloads and apps.

Tags: analytics, Amazon, big data, cluster, dashboard, DevOps, EMR, management, MapReduce, Pepperdata, performance, visibility,

Sean Sutcher
Sean Suchter
Pepper Data

"Today, customers are flying blind with their [Amazon] Elasatic MapReduce workloads. Pepperdata for EMR lets users uncover important information."

Enterprises using Amazon EMR (Elastic MapReduce) to run their big data and analytics projects are about to be offered a new world of visibility and control.


Pepperdata, working with Amazon, has developed a solution to shed light into the darker corners of EMR operations, cracking open its ‘black box’ reputation, Pepperdata CTO Sean Suchter told IDN.  The result: Users will at last be able to see, manage and control the performance and costs for their jobs, workloads and apps running in EMR’s cloud.


With Pepperdata for EMR, the company is looking to deliver long-awaited visibility and controls to Amazon EMR users. Pepperdata for EMR provides

  • Granular performance metrics for current and historical runs
  • Faster troubleshooting for DevOps teams for current and future jobs – while retaining useful data even after EMR clusters are terminated
  • Significant reduction in EMR costs by more efficient use of resources, (auto-scaling, coming Q1 2017)
  • Improved job run times – up to 10 times faster

“Today, customers are flying blind with their EMR workloads,” Suchter said. “With our approach to metrics, troubleshooting becomes fast and easy and highly effective because we look directly at the apps rather than the smoke signals the nodes send up.  . . . Without Pepperdata for EMR, there are just no tools or methods that let users uncover important information,” he added.


Pepperdata for EMR sets out to answer some basic questions about operations and costs for running EMR jobs, including:  

Intelligent Data Summit
Manage Expanding Data Volumes for Analytics & Operations
October 27, 2016
Online Conference

How are my EMR cluster running?
How can I find out where whether (and where) my performance may be slow?
How can I tune or reconfigure my jobs to run faster?
and even
Did my job break in the middle of the run – before it was completed?

We asked Suchter why Amazon EMR hasn’t had discovery or management technologies to answer such question, while Amazon EC2 users have long had options. “The main reasons for the difference are technical,” Suchter said. “Your typical methodology of looking at longitudinal studies on VMs doesn’t work when your VMs keep turning over every day.”


Pepperdata for EMR provides metrics to answer these questions as well as the power to control and make adjustments as needed. Some highlights include:


Detailed Data Capture -- Pepperdata for EMR presents granular details – both real-time and historical details. To present insights, Pepperdata gathers more than 300 metrics second-by-second across hardware, workloads and apps. The approach reveals valuable ‘big picture’ views into EMR tasks, including performance, capacity, jobs duration and more, Suchter told us.


While Pepperdata’s ability to capture so much real-time data is eye-catching, Suchter said that the historic data may prove even more valuable to EMR users. “Because Amazon EMR clusters tend to be ephemeral or short-lived, once a run is complete the cluster terminates taking all performance data along with it,” Suchter noted. “That’s the main reason EMR has been such a ‘black box,” he added. In other words, once the EMR cluster disappears, all the data about its performance disappears also.


Visibility – To present the data in easy-to-digest ways, Pepperdata also provides a graphical dashboard, with views that quickly identify and troubleshoot EMR workloads, cluster inefficiencies and even code-level improvement.


Control – Pepperdata for EMR is not simply an X-ray view into EMR; it also aims to deliver control and action, Suchter added. Under the covers, Pepperdata communicates with agents that run on every data node in the cluster. Based on the metrics collected on CPU, memory, disk I/O, and network resources (by container/task, job, user, and group) Pepperdata can dynamically optimize the usage of those resources. Further, it enables administrators to implement policies that guarantee the completion of high-priority jobs while maintaining the cluster at peak performance.


Simple to Start and Operate – Perhaps best of all, Pepperdata has engineered it solution so one doesn’t need a Hadoop PhD to install or use it. Users can activate the service with a simple “one-line configuration change,” he said. With the Pepperdata for EMR one-click install joint customers gain instant, granular visibility into their clusters’ run-time performance, Suchter said.


Early Pepperdata for EMR Customers Report Several Benefits

We asked Suchter to detail how some early customer are benefitting from Pepperdata for EMR.


Real-Time Alerts -- “We’ll show you that your job broke right at the time when it breaks, and we’ll show you where,” Suchter said. Pepperdata for EMR sends real-time alerts in the cluster, while it is actually running. “Today, without Pepperdata, if something is broken in your EMR workload, you only find out by the time it’s supposed to be done. You look for it and see that it broke. You don’t get any precursors.”


Historical Analysis -- Beyond tapping into valuable real-time EMR metrics, Suchter said that Pepperdata for EMR’s ability to capture and do comparisons with historic data may prove even more valuable to customers. “Even after an Amazon EMR cluster has completed its work and terminated, users will be able to access fine-grained monitoring data that allows customers to view a run and analyze it, as well as compare it with historical data to improve future performance.


Faster Jobs = Less Costs – Pepperdata also cuts the time to run an EMR job, which will almost always translate to cutting costs – almost immediately. One customer consistently required 17 hours to complete their EMR jobs. After just one day of analysis by Pepperdata, the customer was about to shave down that time to 4 hours. “That improvement with one job will result in what we think will be annual savings of about a half-million dollars – and that was just after the first days,” Suchter said.


Inside Pepperdata EMR Technology, 2017 Roadmap

Pepperdata for EMR is based on the company’s Adaptive Performance Core technology, which observes and reshapes applications’ usage of CPU, RAM, network, and disk – all without user intervention. Thanks to Pepperdata metrics, coupled with smart automation, the solution is engineered to dynamically prevent bottlenecks in multi-tenant, multi-workload clusters and ensures jobs efficiently use available resources and will complete on time – and often much faster than without Pepperdata, he told us.


Pepperdata has even more plans to improve Amazon EMR operations.


The company has begun beta availability of its Adaptive Scaling for Amazon EMR. The offering will let customers gain even more control over their EMR jobs and expenses with little more than filling out a form. A user simply specifies a time or budget for their job completion. Pepperdata will take that information and automatically purchase elastic instances with Amazon EMR which is designed to automatically grow or shrink as needed.


Pepperdata EMR is free to customers through Dec 31.


Pepperdata is accepting sign-ups for the beta of its Adaptive Scaling for Amazon EMR now. The product is slated to be publicly available in Q1 2017.