Arindam Ghosh: Setting up Single-node Hadoop 2.4.1 cluster in Ubuntu 14.04 64-bit

This post describes how to setup a single-node Hadoop Cluster.

Installing Java ( OpenJDK ) and Open SSH Server

Check if Java is already installed by typing java -version in the terminal
If you have OpenJDK version greater than 1.6, then fine. Or else install it by from terminal.

NOTE: Open software center, go to edit tab, Software Sources and choose all.
sudo apt-get install openjdk-7-jdk

Check using java -version in the terminal. // remember to write update alternatives if other java versions are installed.
Install Open SSH Server using sudo apt-get install openssh-server

Adding a new user.

sudo addgroup hadoop ( hadoop is a group)
sudo adduser -ingroup hadoop hduser ( hduser is the new user )
Add hduser to sudo group to have all rights - sudo adduser hduser sudo
See below image (press enter to leave blank some fileds)

Configure SSH access.

Install SSH server- sudo apt-get install openssh-server

Switch to hduser in terminal using su - hduser
Type ssh-keygen -t rsa -P "" ( just press enter when asked for where to save the key.)
To enable SSH access, copy those keys into your home folder using this command. ( use id_dsa if rsa.pub doesn't exist )
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
See below image.
Test if it works by typing in the terminal ssh hduser@localhost ( enter hduser's password )
See below image.

Disabling IPv6

Open config file: sudo gedit /etc/sysctl.conf
Add these 3 lines at the end of the file:

#disable ipv6; 
net.ipv6.conf.all.disable_ipv6 = 1 
net.ipv6.conf.default.disable_ipv6 = 1 
net.ipv6.conf.lo.disable_ipv6 = 1

Hadoop Installation

Download Hadoop tar file from apache mirror here - http://apache.mirrors.pair.com/hadoop/common/hadoop-2.4.1/hadoop-2.4.1.tar.gz
Extract it, rename the folder to hadoop.
Change directory in Terminal to the parent folder where you have extracted hadoop.
For example - cd /home/kishorer747/Downloads/
Command to move - sudo mv hadoop /usr/local/
Now own the folder to give all the permissions for hduser using sudo chown -R hduser:hadoop /usr/local/hadoop

Setting Global Variables.

In .bashrc file, profile file and environment file. Do this for both the users. Copy into all 3 files once from normal user in terminal and once again from hduser)
Open these files and add following lines at the end. NOTE: Update your Java version in the variable JAVA_HOME.
After adding the following code, test by typing echo $HADOOP_HOME and echo $JAVA_HOME ( should show home path )
sudo gedit ~/.bashrc
sudo gedit ~/.profile
sudo gedit /etc/environment

# Set Hadoop-related environment variables
export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# Native Path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
#Java path
export JAVA_HOME='/usr/local/Java/jdk1.7.0_65'
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin

Configuring hadoop Configuration files.

Change directory using cd /usr/local/hadoop/etc/hadoop
Open yarn-site xml file and replace with the following lines -
sudo gedit yarn-site.xml




yarn.nodemanager.aux-services
mapreduce_shuffle


yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler

Open core-site xml file and replace with the following lines
sudo gedit core-site.xml



  hadoop.tmp.dir
  /app/hadoop/tmp
  A base for other temporary directories.



  fs.default.name
  hdfs://localhost:54310
  The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.

Open mapred-site xml file and replace with the following lines
sudo gedit mapred-site.xml



  mapred.job.tracker
  localhost:54311
  The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.

Open hdfs-site xml file and replace with the following lines
sudo gedit hdfs-site.xml



  dfs.replication
  1
  Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.

Create some directories.

sudo mkdir -p $HADOOP_HOME/yarn_data/hdfs/namenode
sudo mkdir -p $HADOOP_HOME/yarn_data/hdfs/datanode

Testing Time !! ( IMPORTANT: Run all these commands from hduser. )

Own the hadoop folder again.
Go to bin folder ( cd /usr/local/hadoop/bin ) and format the namenode ( works from any where as we have added hadoop bin folder to System path earlier )-

sudo chown -R hduser:hadoop /usr/local/hadoop
hadoop namenode -format

Exit status should be 0. Else, you messed up somewhere. Look if somewhere permission denied is there in between, and if it is folder cannot be created, you have to own that folder again.
See below images for sample output of the command hadoop namenode -format

Go to sbin folder start all daemons. ( cd /usr/local/hadoop/sbin )

start-all.sh

To check what daemons are running, use this command in terminal ( from hduser )- jps
If any daemon doesn't start, start them manually

hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
yarn-daemon.sh start resourcemanager
yarn-daemon.sh start nodemanager
mr-jobhistory-daemon.sh start historyserver

Hadoop Web Interfaces.

Namenode - http://localhost:50070/
Secondary Namenode - http://localhost:50090
Most important is jps. Use jps to check which daemons are running.

Arindam Ghosh

September 21, 2014

Setting up Single-node Hadoop 2.4.1 cluster in Ubuntu 14.04 64-bit

No comments:

Followers