MapR Embraces Apache Spark To Ease ‘Real-Time’ Analytics with Hadoop

MapR Technologies wants to make it easier for Hadoop to deliver real-time analytics by adding full Apache Spark support to its MapR Hadoop distribution. To learn how full Spark support will help devs and IT operations, IDN talks with MapR’s Jack Norris.

Tags: Apache, analytics, big data, Hadoop, MapR, MapReduce, real-time, Spark, streams, structured, unstructured,

Jack Norris
chief marketing officer

"We’re seeing more companies looking to combine analytics with real-time operational workloads. Full support for Apache Spark dramatically simplifies projects"

MapR Technologies wants to make it easier for Hadoop to deliver real-time analytics. Its Hadoop distribution now supports the full Apache Spark in-memory technology stack.


MapR is supporting Apache Spark as Hadoop adopters are tackling projects to use Hadoop for real-time use cases, according to MapR’s chief marketing officer Jack Norris.

“We’re seeing more and more companies looking to combine analytics with real-time operational workloads,” Norris told IDN. “But, it can be complicated to bring all the pieces needed all together in Hadoop alone. Full support for Apache Spark dramatically simplifies real-time projects – for both operations and developers.”


At least one analyst also noted the uptick in popularity for Apache Spark ”It has become clear that Apache Spark offers a combination of high-performance, in-memory data processing and multiple computation models that is well suited to serving as the basis of next-generation data processing platforms,” said 451 Research’s research director Matt Aslett in a statement.


Delivering real-time insights in Hadoop can be easier said than done, mainly because it typically requires bringing together “multiple pieces,” Norris explained.


Among those pieces, he said, are

  1. The ability to do deep analytics using multiple data types
  2. Support massive clustering operation at low-latency
  3. A capability to bring in streaming data – and combine this with other data to .discern the meaning of data – and derive insights from all these data sources
  4. To create an environment that triggers intelligent real-time responses to what the data is uncovering – whether the result was anticipated at the start or was something unexpected.

MapR’s approach to Apache Spark support removes barriers to bringing these pieces together for IT operations and devs teams, according to Norris.


For IT operations, the Apache Spark engine can run programs up to 100x faster than Hadoop MapReduce in-memory alone, or 10x faster on disk, according to the Apache Spark community site. MapR’s support leverages Spark to overcome a trait in MapReduce that can make getting real-time results practically impossible.


Norris cited this use case: “In MapReduce, there is construct that assumes there is a ‘batch’ process, and so every result needs to be sent to disc,” which dramatically slows outcomes, Norris said. In addition, Apache Spark’s advanced DAG execution engine supports cyclic data flow and in-memory computing.


For dev teams, Apache Spark offers more than 80 high-level operators to simplify the design of parallel apps. Just how easy? Spark jobs can require as little as one-fifth the number of lines of code. Spark also provides a simple programming abstraction so devs can design apps as operations on data collections (known as RDDs, or resilient distributed dataset). Devs can build in Java, Scala and Python, and the same code they write can be reused across batch, interactive and streaming applications.


“There are other efficiencies and re-use,” Norris added. “A dev can use the same algorithm in Spark and apply it to batch, streaming and interactive process. Each would have to be done separately prior to Spark.”


As for where Apache Spark is having the biggest impact, Norris said the technology stack is proving especially valuable in simplifying real-time “machine-learning workloads,” Norris added. “This developer community is very enamored with Spark, and they think it’s the future. So, MapR is happy to provide them full support for Spark and wrap it with our enterprise-class features [for Hadoop] such as data protection, data recovery and high availability, he told IDN.



MapR’s distro does not require people to use Apache Spark, Norris said. It is simply the company’s latest support option.


“We remain open to supporting a wide number of projects. Spark is just the latest,” Norris told IDN. On that point of openness, including Apache Spark, the MapR Hadoop distro includes more than 20 Apache open source projects with support for batch, interactive, streaming, graph and machine leaning.

Users can expect ongoing partnerships between MapR and innovative open source communities, according to MapR’s founder and CEO John Schroeder. “The open source community is developing tremendous technology innovations at a rapid pace. MapR provides a future-proof investment for our customers with the most open distribution to give them flexibility to pick the right solution with the widest range of compute frameworks and libraries,” Schroeder said in a statement.


MapR also announced that bookings for its MapR hadoop distro between January and March 2014 tripled over the same period last year. It also released tight integration with HP Vertica’s high-performance analytics platform, which provides interactive SQL-on-Hadoop and expanded capability to explore semi-structured data and traditional structured data.