Tuesday, November 15, 2016


Hadoop is used extensively in industries to analyze their data sets as Hadoop framework is based on a simple programming model called MapReduce and it enables a computing solution that is scalable, flexible, fault-tolerant and cost effective.

Apache Spark is a fast and general engine for large-scale data processing which run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.

In this tutorial steps to install spark in your system will be explained:-

Verifying and updating Java Installation

Java installation is the first step in installing Spark. Try the following command to verify the JAVA version.

$java -version

Update the available files in your default java alternatives so that java 8 is referenced for all application. In case you do not have Java installed on your system, then Install Java before proceeding to next step.

Install Scala

Spark is written in Scala, so we need to install Scala to built Spark. Download the latest stable version of Scala
The following commands will download and place it in the right directory:

$ wget http://www.scala-lang.org/files/archive/scala-2.10.6.tgz
$ sudo mkdir /deepankar/local/src/scala
$ sudo tar xvf scala-2.10.6.tgz -C /deepankar/local/src/scala/

Go to the end of your “~/.bashrc” file and add the following lines:

export SCALA_HOME=/usr/local/src/scala/scala-2.10.6 
export PATH=$SCALA_HOME/bin:$PATH

Restart “.bashrc” file:

$ . ~/.bashrc

Check if Scala is installed successfully by running the following command:

$ scala -version

After verifying the scala version and installation you can proceed to the next step .

Downloading and installing Apache Spark

Download the latest version of Spark. After downloading it, you will find the Spark tar file in the download folder.
The following commands for moving the Spark software files to respective directory (/deepankar/local/spark).

$ su –
Password: 

# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit 

Add the following line to ~/.bashrc file. It means adding the location, where the spark software file are located to the PATH variable.

export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.

$ source ~/.bashrc 

Verifying the Spark Installation

The following screen will show up validating a successful installation:

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
   ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.0
      /_/ 
 
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc 
scala>  

You can start the spark shell by running the following command: 

$ spark-shell

You can run all the spark commands from this shell to do all the magic !