Considering that, hadoop is installed across all machines
- Download and install scala either using tarball or using command
sudo apt-get install scala
- Check the version of scala using command
scala -version
- Update the hosts file in /etc/hosts , by adding the all machines ip-address with their hostnames
sudo vi /etc/hosts
120.120.120.114 bigdata-node-1-1
120.120.120.113 bigdata-node-1-2
120.120.120.112 bigdata-node-1-3
- Download apache spark and untar it, and move it to some location
tar xzf spark-1.6.0-bin-hadoop2.6.tgz
sudo mv spark-1.6.0-bin-hadoop2.6/ /opt/
- Update configuration files under /opt/spark-1.6.0-bin-hadoop2.6/conf/
add worker/slaves nodes ip-address or host-name in slaves files
vi slaves
bigdata-node-1-2
bigdata-node-1-3
sudo cp spark-env.sh.template spark-env.sh
add the below lines in the spark-env.sh file at the end
vi spark-env.sh
SPARK_JAVA_OPTS=-Dspark.driver.port=53411
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_MASTER_IP=bigdata-node-1-1
sudo cp spark-defaults.conf.template spark-defaults.conf
add the below lines in the spark-defaults.conf file
vi spark-defaults.conf
spark.master spark://bigdata-node-1-1:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
- Perform the above steps on each system
- Generate pubkey using ssh on master machine
ssh-keygen -t -rsa
Press enter to generate
Copy the pubkey to authorized_keys file
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Copy the master’s pubkey to all slave machines to the ~/.ssh/authorized_keys
- Update the spark environmental variables under .bashrc file
sudo vi ~/.bashrc
export SPARK_HOME=/opt/spark-1.6.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:$SPARK_HOME/sbin
Save and exit
source ~/.bashrc
exec bash
- Start all the daemons using start-all.sh or start-master.sh and start-slaves.shTo access spark web ui: http://<ip-address_of_master>:8080/
- To stop , use stop-all.sh or stop-master.sh and stop-slaves.sh