I have used Ubuntu 14.04 LTS for this tutorial. However following steps should work with newer versions of Ubuntu and with also other Debian based Linux distros.
Before we begin with spark, we need to install other dependencies.
Following set of commands will install Java8 on your system. You can skip this steps if you already have Java8 installed on your system. If you are having any other older version of Java installed then it recommended to upgrade it to Java8.
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer
In Webupd8 ppa repository also providing a package to set corresponding environment variables ...
$ sudo apt-get install oracle-java8-set-default
In order to verify whether Java8 is successfuly installed, fire following command:
$ java -version
and output should be:
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
You might also want to install Scala which is generally preferred language for Spark programming:
You need to download Scala from here, extract the files in some location for example /usr/local/scala/. Alternatively you can fire following set of commands to achieve the same...
$ wget http://www.scala-lang.org/files/archive/scala-2.10.6.tgz
$ sudo mkdir /usr/local/scala
$ sudo tar xvf scala-2.10.6.tgz -C /usr/local/scala/
Now in order to make scala reachable from any location on your file system, we need to set/modify some environment variables.
Go to your home folder using this command: $ cd ~
And open .bashrc file in your favorite editor: $ vi .bashrc
Append following lines at the end of the file:
Execute the modified .bashrc file with this command in order to make the changes effective.
$ source .bashrc
To verify successful scala install fire this command:
$ scala -version
It should return following output
Scala code runner version 2.10.6 -- Copyright 2002-2013, LAMP/EPFL
Note: We have used 2.10 version of Scala as in order to use latest stable Scala vesion (2.11) we need to manually build spark from it's source which is quite time consuming. Moreover Spark does not yet support its JDBC component for Scala 2.11. Reference: http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211
So in-case you have requirements so that you must have to use Scala 2.11 then you can download spark source and build it by following instruction given on this link.
Now we are set to install Spark.
Download Spark from this page: http://spark.apache.org/downloads.html
From package type drop down select pre-build package matching your Hadoop version. Also as mentioned above note, you always have option to download source from the same link and build spark tailored to your needs.
Once the download is complete, you may extract the package in some appropriate location.
We are all set. Let's test spark with some example script. Go to bin directory under the extracted package and fire this command from terminal.
$./run-example SparkPi 10
You may get following output:
Pi is roughly 3.14634
Bingo!!! Next step to get started with spark is here: http://spark.apache.org/docs/1.1.1/quick-start.html
Queries, doubts, suggestions?? Comments are free ...:D