Spark learning journey - Part 1


I need to start getting accommodated with Spark.This is both because I will need it in my new project but also because it is quite a successful platform these days and those of us working in data related projects need to know the bits and pieces of cluster  processing systems.

I'll start a series of notes from my learning curve, this is the first post. I hope some of you who are new to this platform will find them useful, so i'll write them for beginners, simply as my notes.

First thing you should try to get an overall understanding of what is Spark and what are the industry problems it solves. Among many very good documentation I browsed online, I would say this  link provides a good overview summary that is easy to read and understand.

Step 1- Document yourself 

For documenting, first thing I started to do is reading the book  Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka by Raul Estrada (Author), Isaac Ruiz (Author).

I read the book and practiced some examples, especially the ones related to RDD transformations and actions. While reading, I tried to understand deeper some concepts by browsing online.
One small inconvenience for me it was that most of the sample code was using Scala and I was not used with it, but anyways, I used this experience to get some very high level feeling of Scala as well.


Step 2 - Install the platform

This comes as a must even from step 1, as you need to practice.
Download the platform kit from https://spark.apache.org/downloads.html and install it on your local.
I used Windows as I found it easier, given my office OS is also Windows.

Install on your local means extract the zip contents and that is all. You need to get accommodated with the contents of the bin directory.

The most used CLI commands for the beginning and the purpose of learning are  spark-shell and spark-submit.

./spark-shell - Spark is shipped with an interactive shell / Scala prompt, as spark is developed in Scala. Using the interactive shell we will run different commands (transformation / action) to process the data
./spark-submit - is used to launch applications on a cluster.It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application especially for each one. More details on the spark-submit parameters you can get from the original Spark documentation page.

Step 3 - Build your first app and run it in a standalone cluster

Many online material, as well as the SMACK book mentioned above, show examples using maven. For some reasons I could not use maven in my first attempt to learn Spark, as I had some permission issues in the office which I did not have the time to solve. As result I chose to build a new simple project by manually adding the dependencies when creating the jar to be given for spark-submit.
The code that I first played with is taken from this well documented blog, and plays with RDDs and joins. It also contains references to Hadoop but I did not want to use it. I wanted the bare minimum related to spark,so I changed the code a little bit.

This example is quite powerful as it shows a real life example where you receive as input two files, one with purchasing transactions of products, and one with the customer users, and it asks for an analysis result, such as the number of unique locations in which each product has been sold.

Here is my modified code version but I recommend you to read all the comments and explanation given by the original author.

To run it I created a jar by first creating the manifest.mf  with the three dependent jars included in lib directory.

jar cvfm sparkExample.jar manifest.mf lib *.*

Manifest.mf

Main-Class: ExampleJob
Class-Path: lib/spark-core_2.11-2.3.1.jar lib/scala-library-2.11.8.jar lib/hadoop-mapreduce-client-core-2.7.3.jar


Later I read in the official documentation that when creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.

When running the code it shows the results displayed on the output console.

spark-submit --class ExampleJob --master local ./sparkExample.jar transactions.txt users.txt dir




So, now, we have managed to get a very high level introduction to Spark engine and run our first application on a standalone cluster.

There still much more to cover in future posts ;)

Comments

Popular posts from this blog

I Take Unconference 2018 Highlights

Happy Programmer's day - 13 September 2018