This article attempts to give a step by step walk through for creating a single Node Hadoop Cluster. It is a hands on tutorial so that even a novice user can follow the steps and create the Hadoop Cluster.
Setup – Ubuntu 12.04 VM
Details of VM
– Virtual Box – 32 bit – Ubuntu 12.04 – RAM – 1 GB – HDD – 40 GB
Details of Java on VM
– OpenJDK 1.6 – IcedTea
We will select Apache Hadoop 1.0.x version, which is the latest stable release.
This was the mirror suggested to me –
We will select version 1.0.3 in the tar.gz file format.
The complete link location is –
We will put it in /usr/local directory.
These are the commands in sequence. It would be cool if they can be put into a script.
As my ‘hadoop’ user was not in ‘sudoers’ list, but user ‘sumod’ was, I used ‘sumod’ user to get the tar.gz file.
sumod@sumod-hadoop:/usr/local$ sudo wget http://apache.techartifact.com/mirror/hadoop/common/hadoop-1.0.3/hadoop-1.0.3.tar.gz
We will untar the directory.
sumod@sumod-hadoop:/usr/local$ sudo tar -zxvf hadoop-1.0.3.tar.gz
We now have the directory hadoop-1.0.3. I will not rename it so that I always know the version number.
Let’s change the ownership of the installation.
sumod@sumod-hadoop:/usr/local$ sudo chown -R hadoop:hadoop hadoop-1.0.3
We will now set HADOOP_HOME, JAVA_HOME and add HADOOP_HOME to the path by editing .bashrc of the ‘hadoop’ user.
#Add HADOOP_HOME, JAVA_HOME and update PATH
If these changes are not taking effect when you switch user to hadoop or when you ssh, please add this line in your .bash_profile file in your home directory. If .bash_profile file does not exist create it first.
We need to configure JAVA_HOME variable for the hadoop environment as well. The configuration files will be usually in the ‘conf’ subdirectory while the executables will be in the ‘bin’ subdirectory.
The important files in ‘conf’ directory are
hadoop-env.sh, hdfs-site.xml, core-site.xml, mapred-site.xml.
hadoop-env.sh – Open the hadoop-env.sh file. It says on the top that hadoop specific environment variables are stored here. The only required variable is JAVA_HOME. In this file, the variable is already defined and the line is commented out. Edit the line to update the JAVA_HOME variable. In our case,
conf/*-site.xml – The earlier hadoop-site.xml file is now replaced with three different settings files – core-site.xml, hdfs-site.xml, mapred-site.xml. The main parameters that you need to refer to or modify in these three files are
core-site.xml – hadoop.tmp.dir, fs.default.name
hdfs-site.xml – dfs.replication
mapred-site.xml – mapred.job.tracker
Now that we have downloaded, extracted and configured hadoop, it is time to start the installation. The first step would be to format Namenode. This initializes the FSNameSystem specified by the ‘dfs.name.dir’ variable. It will also write a VERSION file that specifies the namespace ID of this instance, ctime and version.If you format NameNode, you also have to clean up datanodes. Note that if you are just adding new datanodes to the cluster, you do not need to format NameNode.
Format HDFS system via NameNode
I gave the command – $hadoop namenode -format
I got the warning – $HADOOP_HOME is deprecated. So I am going to make following change in hadoop-env.sh file.
If you get any exceptions with XML file, please check if you have properly closed the tags.
Start your cluster
Everything has gone well so far, start the single node cluster.
Using NameNode web interface, we can browse the hadoop file system and logs. It is the HDFS layer of the system. Using the JobTracker, we can see the job history. Using the TaskTracker web interface, we can view the log files. JobTracker and TaskTracker come in the MapReduce layer of the system. We can also view number of Map and Reduce tasks scheduled. Using NameNode, we can view the output, input files, status of the nodes. I am able to see in my setup the default block size is 64 MB. In the usual hadoop setup, the default block size is 128 MB.
Well, that was pretty much about setting up Hadoop on a single node Ubuntu cluster.