Distributed Hadoop installation - Fedora Linux

Distributed  Hadoop and HBase installation - Fedora Linux

In this post I will describe how to get started with the latest versions of hadoop and hbase describing all the step to obtain a working hadoop installation. The steps described here can be easily used to perform a working installation on a large cluster (even tough it can requires additional steps as shared filesystem for instance).

Prerequisites

 sudo dnf install openssh openssh-askpass openssh-clients openssh-server 

Don't forget to start the ssh service using the following command:

 sudo service sshd start 

Add a dedicated Hadoop/HBase User (optional but reccomended)

 
    #create group hadoop
    sudo groupadd hadoop
    #create user hadoop_user #-g  add the newly created user to the specified group
    sudo useradd -g hadoop hadoop_user
    passwd hadoop_user

hadoop_user SSH passwordless login

Hadoop and HBase requires passwordless ssh login. Configuring it is fairly easy, we simply need to generate the rsa keys for the user we want grant the login and then copy its public key into the autorized_keys file in the login machine.

    #login as the newly created user
    su - hadoop_user
    #create .ssh folder
    mkdir -p ~/.ssh
    #generate keys for hadoop_user (leave default if asked for some input)
    ssh-keygen -t rsa -P ""
    #install keys (in a cluster setting do this step for each cluster node) - (in some system is authorized_keys2)
    cat ~/.ssh/is_rsa.pub >> ~/.ssh/authorized_keys 
    #change permissions for ssh keys folder and authorized keys file
    chmod 700 ~/.ssh
    chmod 640 ~/.ssh/authorized_keys

Hadoop conf files set up

hdfs-site.xml

Create datanode, namenode and tmp directories and set the correct permission on them.

 
    mkdir$HADOOP_WORKING_DIR/datanode_dir $BASE/namenode_dir $BASE/tmp
   #change permissions on the datanode_dir and namenode_dir
   chmod -R 750 namenode_dir
   chmod -R 750 datanode_dir

We can now provide these newly created directory paths to the hdfs-site.xml configuration file. The following is a basic hdfs-site.xml you can use as a template (changing directory paths accordingly)
$HADOOP_WORKING_DIR in my case is mnt/DATA/VELaSSCO/hadoop-2.6.4/

 
<configuration>

        <property>
            <name>dfs.namenode.name.dir</name>
            <value>/mnt/DATA/VELaSSCO/hadoop-2.6.4/hadoop_dirs/namenode_dir</value>
            <description>Determines where on the local filesystem the DFS name node
              should store the name table.  If this is a comma-delimited list
              of directories then the name table is replicated in all of the
              directories, for redundancy. </description>
            <final>true</final>
          </property>
          <property>
            <name>dfs.datanode.data.dir</name>
            <value>/mnt/DATA/VELaSSCO/hadoop-2.6.4/hadoop_dirs/datanode_dir</value>
            <description>Determines where on the local filesystem an DFS data node
               should store its blocks.  If this is a comma-delimited
               list of directories, then data will be stored in all named
               directories, typically on different devices.
               Directories that do not exist are ignored.
            </description>
            <final>true</final>
          </property>
          <property>
              <name>dfs.webhdfs.enabled</name>
              <value>true</value>
          </property>
          <property>
            <name>dfs.domain.socket.path</name>
            <value>/mnt/DATA/VELaSSCO/hadoop-2.6.4/hadoop_dirs/hadoop-hdfs/dn_socket</value>
          </property>
        <property>
          <name>dfs.replication</name>
          <value>1</value>
          <description>Default block replication.
          The actual number of replications can be specified when the file is created.
          The default is used if replication is not specified in create time.
          </description>
        </property>
</configuration>

core-site.xml

In this file we should specify the location of the hadoop tmp folder (created before) and the URI for the HDFS.
The following is an example of basic (yet working) core-site.xml file.

 
<configuration>
   <property>
        <name>hadoop.tmp.dir</name>
        <value>/mnt/DATA/VELaSSCO/hadoop_dirs/tmp</value>
        <description>A base for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:54310</value>
    </property>
</configuration>

mapred-site.xml

Mapreduce-site.xml tells hadoop how to execute mapreduce jobs, which MapReduce management framework to use and other useful settings as the maximum number of reduce or map operation per node.
Notice that by defult the mapred-site.xml file does not exist in the etc/hadoop/ folder but it can be easily created from the template in the same folder.

 
    #create the mapred-site.xml file    
    cp mapred-site.xml.template mapred-site.xml

The following is a mapred-site.xml file with basic settings:

 
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
            </property>
        <property>
                <name>mapreduce.jobtracker.http.address</name>
                <value>localhost:50030</value>
        </property>
        <property>
                <name>mapreduce.tasktracker.http.address</name>
                <value>localhost:50060</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>localhost:10020</value>
        </property>
        <property>
                <name>mapreduce.tasktracker.map.tasks.maximum</name>
                <value>8</value>
        </property>
        <property>
                <name>mapreduce.tasktracker.reduce.tasks.maximum</name>
                 <value>2</value>
        </property>

Settings should be self explaining.

$JAVA_HOME and $PATH settings in bashrc.sh

$JAVA_HOME variable should be set such that it points to a working JAVA 1.7+ implementation.
You can also add the hadoop binaries to the $PATH variable so it would be easier to use all the hadoop tools.

Use the following as a template changing paths accordingly.

 
export JAVA_HOME=/usr/java/latest/
export HADOOP_HOME=/mnt/DATA/VELaSSCO/hadoop-2.6.4
export PATH=$PATH:$HADOOP_HOME/bin

Testing the Hadoop Installation

It's time to test out installation of hadoop:

 
    #starts all hadoop services
    cd $HADOOP_ROOT/sbin
    chmod u+x start-all.sh
    ./start-all.sh

    #use JPS to see is everything run
    jps

jps should output something like

 
2339 DataNode
2900 SecondaryNameNode
15752 Jps
6104 Main
3704 NodeManager
3356 ResourceManager
1981 NameNode

and you can also check if hadoop installation was succesful checking the listening ports. The ones we specified in the hadoop installation configuration files should appear:

 
$ netstat -plten  | grep java

tcp        0      0 0.0.0.0:50010           0.0.0.0:*               LISTEN      1001       2290864    2339/java           
tcp        0      0 0.0.0.0:50075           0.0.0.0:*               LISTEN      1001       2284994    2339/java           
tcp        0      0 0.0.0.0:50020           0.0.0.0:*               LISTEN      1001       2285002    2339/java           
tcp        0      0 127.0.0.1:54310         0.0.0.0:*               LISTEN      1001       2290884    1981/java           
tcp        0      0 0.0.0.0:50090           0.0.0.0:*               LISTEN      1001       2285187    2900/java           
tcp        0      0 0.0.0.0:50070           0.0.0.0:*               LISTEN      1001       2286055    1981/java           
....

This should conclude the hadoop installation. Future post will address execution of examples and HBase/Hadoop interaction.

Be the first to leave a comment. Don’t be shy.

Join the Discussion

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>