Spark up the Integration of Zeppelin and Livy

Author: Colton Rodgers, Infrastructure Engineer | HDPCA, HDPCD

Overview

In-memory computations for machine learning and data science are a large part of business processes and decisions today. In this age of information we are ever striving for ways to achieve faster response times, faster performance, better results, and more data. Using the Apache tools Spark, Livy, and Zeppelin together can be a very powerful way to help tackle these seemingly insurmountable tasks.

Zeppelin

Apache Zeppelin is a web-based notebook that allows for data-driven and interactive data analytics. The notebook feature of this allows for collaboration on documents and features the ability to support multiple different languages such as: SQL, Python, SparkSQL, Scala, and R. A strong point of emphasis for Zeppelin is its interactivity on long workflows which can allow for on the fly changes to code. Another area to highlight for Zeppelin would be its, “Modern Data Science Studio.” This supports Spark and Hive straight out of the box and allows for the exploration, reporting, and visualization on the Spark and Hive data.

Spark

Spark is a powerful and versatile in-memory processing tool. It is quite often coupled with the Hortonworks Data Platform, but stands alone as an industry-leading giant, depending on business needs. Spark boasts up to 100 times faster run time than Hadoop MapReduce while in memory, and up to 10 times faster while on Disk. Spark is also very flexible and allows for the combination of SQL and streaming analytics.

Livy

Livy allows for Zeppelin to become a programmatic, fault-tolerant, and multi-tenant submission system for Spark jobs. This way, you can have many different users interacting with your Spark cluster at the same time with a high degree of reliability. Livy can speak in either Scala or Python so that clients have the ability to talk with a Spark cluster in either of the two languages.

Deployment and Execution

I’ve recently finished an implementation of these products at a client who is aiming to detect fraud in real-time in their line of business. Using these three products together is allowing them to collaboratively create scripts and models that will allow them to more closely detect any suspicious activity.

This process is pulling in multiple gigabytes worth of data to process quickly and efficiently which has allowed them to receive results in a much quicker manner. This allows for timely changes to code and quicker reaction to certain findings which ultimately results in a more effective business.

What can this combination of Apache tools can do for you and your company?