Hadoop is
used extensively in industries to analyze their data sets as Hadoop framework
is based on a simple programming model called MapReduce and it enables a
computing solution that is scalable, flexible, fault-tolerant and cost
effective.
Apache Spark
is a fast and general engine for large-scale data processing which run programs
up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Apache Spark has an advanced DAG execution engine that supports cyclic data
flow and in-memory computing. Spark offers over 80 high-level operators that
make it easy to build parallel apps. And you can use it interactively from the
Scala, Python and R shells.
In this
tutorial steps to install spark in your system will be explained:-
Verifying and updating Java Installation
Java
installation is the first step in installing Spark. Try the following command
to verify the JAVA version.
$java -version
Update
the available files in your default java alternatives so that java 8 is
referenced for all application. In case you do not have Java installed on your
system, then Install Java before proceeding to next step.
Install Scala
Spark is
written in Scala, so we need to install Scala to built Spark. Download the
latest stable version of Scala
The
following commands will download and place it in the right directory:
$ wget
http://www.scala-lang.org/files/archive/scala-2.10.6.tgz
$ sudo mkdir /deepankar/local/src/scala
$ sudo tar xvf scala-2.10.6.tgz -C
/deepankar/local/src/scala/
Go to the end of your “~/.bashrc” file and add the
following lines:
export
SCALA_HOME=/usr/local/src/scala/scala-2.10.6
export
PATH=$SCALA_HOME/bin:$PATH
Restart “.bashrc” file:
$ .
~/.bashrc
Check if Scala is installed successfully by running
the following command:
$ scala -version
After verifying the scala version and installation
you can proceed to the next step .
Downloading and installing Apache Spark
Download
the latest version of Spark. After downloading it, you will find the Spark tar
file in the download folder.
The
following commands for moving the Spark software files to respective directory
(/deepankar/local/spark).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
Add the
following line to ~/.bashrc file. It means adding the location, where the spark
software file are located to the PATH variable.
export
PATH = $PATH:/usr/local/spark/bin
Use the
following command for sourcing the ~/.bashrc file.
$ source
~/.bashrc
Verifying the Spark Installation
The
following screen will show up validating a successful installation:
Spark assembly has been built with Hive, including
Datanucleus jars on classpath
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing
view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing
modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager:
SecurityManager: authentication disabled;
ui acls
disabled; users with view permissions: Set(hadoop); users with modify
permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP
Server
15/06/04 15:25:23 INFO Utils: Successfully started
service 'HTTP class server' on port 43292.
Welcome to
____ __
/
__/__ ___ _____/ /__
_\ \/ _
\/ _ `/ __/ '_/
/___/
.__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit
Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
You can
start the spark shell by running the following command:
$ spark-shell
You can
run all the spark commands from this shell to do all the magic !