How Apache Drill Could ‘Democratize’ Big Data: Simplicity, Self-Service & Lower Cost

For big data adopters, Spring 2015 was a sweet season of simplicity. One big example: Apache Drill lets non-technical users explore huge and varied sets of data on-the-fly – and avoid tons of time and expense on data prep by IT.  To see how Drill could democratize big data, IDN talks with MapR Technologies’ CMO Jack Norris.

Tags: analytics, big data, Apache, Drill, Hive, JSON, MapR, MongoDB, NoSQL,

Jack Norris
chief marketing officer

"Drill is really exciting because it addresses the important last mile of self-service."

Big Data in Motion Summit
Manage Expanding Data Volumes for Analytics & Operations
July 23, 2015
Online Conference

For big data adopters, Spring 2015 truly was the sweet season of simplicity. Amidst the wave of new technologies and commercial product updates, the debut of open source project Apache Drill 1.0 deserves attention.


Apache Drill 1.0 sports new technologies that will empower non-technical users to explore massive amounts of data without tons of data prep. Drill was inspired by Google's research paper Dremel: Interactive Analysis of Web-Scale Datasets. The Apache Software Foundation said its intent was to design Drill to scale to 10,000 servers (or more), and process petabytes of data / second (translation: trillions of records in seconds).


Taken all together, Drill adds up to big benefits for big data. An ASF press release put it, “Apache Drill revolutionizes data exploration and analytics.”


To truly understand its impact, one has to go under the covers of Apache Drill. It brings together four (4) inter-related tech breakthroughs to try to blast through some of the toughest big data barriers facing IT departments today:

Flexibility. A schema-free SQL query engine lets Drill work with (or without) Apache Hadoop. To run queries with Hadoop, users connect Drill to Hive, HBase, or distributed file system data sources. Drill also runs with NoSQL (MongoDB, HBase) and cloud storage (Amazon S3, Google Cloud Storage, Azure Blog Storage, Swift). Drill can even explore data without Hadoop, running on a Linux, Mac OS X or Windows laptop)


Schema-Freedom. Drill is different from traditional SQL-on-Hadoop solutions (Hive, Impala). Using JSON to access self-describing documents and data, users can access and query without creating or managing schemas. Drill also processes all these data types in-situ, so native capabilities of these datastores aren’t compromised. Drill can even handle data with evolving schema or no schema (e.g. JSON files, MongoDB collections, HBase tables). It also can intermingle any data – structured, semi-structured or unstructured (e.g. IoT, sensors, click-streams, etc.).


Simplicity & Speed. Drill won’t impose hits to performance. In fact, it explores data on-the-fly. Again, thanks to JSON, which understands self-describing data sources. A Drill query is automatically compiled and re-compiled during the execution phase, based on the actual data flowing through the system. This approach avoids the time-consuming task of having to guess at schemas using rules or sampling.


The Road to Self-Service Analytics. Drill delivers self-service SQL analytics – without requiring pre-defined schema definitions. It also comes with native support for popular business intelligence (BI) and data visualization tools. Drill connects to all BI tools using standard ODBC connectors.

Under the Covers with Apache Drill 1.0 – Attacking Data’s ‘New Reality’

The Apache Drill project was born in 2012 out of the tremendous popularity of non-relational, unstructured and streaming data and increased adoption of non-relational, schema-free datastores. This shift away from traditional RDBMS cried out for a new data architecture, according to Apache engineers.


“The architecture of relational query engines and databases is built on the assumption that all data has a simple and static structure that’s known in advance, and this 40-year-old assumption is simply no longer valid,” said Jacques Nadeau, vice president of Apache Drill in a statement.

“We designed Drill from the ground up to address the new reality.”


To meet data’s ‘new reality,’ Drill sports some key design features:

  • Introducing the new JSON document model to the world of SQL-based analytics and BI. “This enables users to query fixed-schema, evolving-schema and schema-free data stored in a variety of formats and datastores,” Nadeau said.
  • An innovative “columnar execution” engine, designed expressly to support complex and schema-free data.
  • For speed, Drill’s execution engine performs data-driven query compilation (and re-compilation, aka schema discovery) during query execution. An “optimizer” can exploit Apache Parquet's columnar storage for max performance.
  • To (finally) deliver on promises for ‘self-service’ by non-technical users, Drill sports a new UI, dubbed Drill Explorer. Apache describes how Drill Explorer works Drill Explorer is . . . for browsing Drill data sources, previewing the results of a SQL query, and creating a “view.” Typically, you use Drill Explorer to explore data or to create a view that you can query as if it were a table. For example, before designing a report using a BI reporting tool, use Drill Explorer to quickly familiarize yourself with the data. In an ODBC-compliant BI tool, use the ODBC DSN to create an ODBC connection.

Drill Explorer also connects Drill to Hive HBase, Parquet, JSON, CSV or TSV files.


The Move To Democratizing Big Data – Apache Drill + MapR + Others

From its inception, MapR Technologies has been a big supporters of the Apache Drill project. In fact, MapR believes so strongly in Drill that last month it became the first Hadoop distro to pre-build Drill with its offering.


According to MapR’s CMO Jack Norris, the push to cut down on big data time-to-value (notoriously long) was a big driver for the company’s involvement. One main driver was to make big data projects easier and speed up time-to-value, he told IDN.


“Before Drill, the key to big data success used to be a lot of prep, define the schema correctly and even after all that – do more analysis to figure out what are the questions you’re going to ask ahead of time,” Norris said. “Now, with Drill, you don’t have to define the data in a certain way or in a certain format. You don’t even know the queries you’re going to ask. You can just go right to your data directly – it’s a huge change.”


Even more than removing a lot of the cost, complexity and delay. Drill is setting the stage for something more exciting, an era of self-service analytics – even by non-technical users, Norris said. “Drill is really exciting because it addresses the important last mile of self-service.”

IDN asked him if his conclusion (and enthusiasm) is based on ‘real architecture’ or ‘marketecture’ (aka wishful thinking + PowerPoint slides). He had a good response.


“So, self-service has been thrown around as a term for quite some time. But it usually meant something like, ‘Look you don’t have to have a developer design the report or query – you can do it on your own,’” Norris told IDN. “But that [definition] assumes the data they need has already been prepped and is available. With Drill, you don’t have to wait for IT to prep that data. You are going directly against that raw file,” he said.


To show Drill brings ‘self-service analytics’ within reach, Norris shared a story from a MapR customer visit, where his team was doing a demo.


“Last week, we had a sales engineer demo with Drill at a client site. They liked what they saw, and then said to our team, ‘Why don’t you use our data [for the demo]. So, we pointed Drill at their data and did exploration directly on their data right then. It just blew them away,” Norris told IDN. The story also shows how Drill is stomping out data prep. “With Drill, data is available from the time it is created, without IT preparation, he added.


Enterprise-Class Big Data Benefits of Apache Drill + MapR Distro

With MapR’s decision to bring Drill into its enterprise-grade Hadoop platform, users gain all the quick-start and easy-use features of Drill – and can infuse them with many of MapR’s mission-critical and real-time platform features.


Indeed, MapR’s core Hadoop distro features will work with Drill. They include support for Low-latency clustering, even at massive scale; Live updating of clusters; High availability; Disaster recovery; Resource management (via YARN); and Security controls along with high-speed, wire-level encryption (for data sent between nodes). In fact, Drill sports its own approach to security, and complements MapR security features.


“Traditional Hadoop provides fine-grained permission to protect data accessed through Hive. But, users could employ another tool to circumvent permission protections,” Norris said. For its part, Drill provides granular control through its ‘views,’ by making the view dependent upon role-based permissions.


“Files are locked down, and can only access through a view,” he said. Drill offers security controls for authentication, row/column levels on distributed data. To enforce the security of these ‘views,’ Drill integrates with existing enterprise directory services (LDAP, Active Directory, etc.), Norris added.


With Drill’s focus on simplicity, speed and self-service, we asked Norris about Drill’s impact on the big data / Hadoop sector. “Drill will change the paradigm of how people think about a big data project. The technology basically lets an analyst point at this data source or that data source or multiple ones – and explore the data directly, in ways they couldn’t before. Once people grasp it, Drill is really going to be game-changing.”


Indeed, the chorus of companies singing Drill’s praises (and contributing to the project) is already a broad spectrum. It includes Information Builders, Jinfonet Software, MicroStrategy, Qlik, Simba, Tableau and TIBCO.


Apache Drill 1.0 is available for download here, and released as open source under the Apache License v2.0.