Configuration of Hadoop By Ansible

3 min readDec 12, 2020

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. Hadoop consist of Namenode and a large no of datanodes which handles all the data storage, and configuring each data nondes manually is a time consuming task. So, we use some configuration management tools to configure the nodes, here we are using ansible.

We will create a role in ansible which will dynamically fetch the ip’s of nodes and configure them automatically. So, before moving further lets’s have a look on the road map of the role.

Road Map of Role

Download the software of hadoop and jdk
Install the software
Copy the modified files
Create Directtory for data_storage
Format the namenode once
Start the service

Lets’ s move step step by step

Step1. Downloading of software

Here we first download the software from the publicly available websites on our nodes, here im downloading the jdk from my amazon s3 and hadoop from apache site.

- name: downloading the file  
  get_url:    
    url: "{{ item }}"    
    dest: /home/ec2-user/ 
  loop: 
  - https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm
  - https://lalitbucket6033.s3.ap-south-1.amazonaws.com/jdk-8u171-linux-x64.rpm

Step2. Installing the software

After downloading we will Install the softwares in both data ans namenodes

- name: installing softwares  
  command:
    cmd: "{{ item }}"  
  args:   warn: no  
  loop:    
  - rpm -ivh /home/ec2-user/jdk-8u171-linux-x64.rpm --force    
  - rpm -ivh /home/ec2-user/hadoop-1.2.1-1.x86_64.rpm --force

Step3. Copying the edited files

Here this task copy the edited files according to the namenode and datanodes separately. You can refer the below github link for this files, where i used jinja module to modify these files.

- name: copying the files  
  template:    
    src: ../templates/{{node_name}}/{{item}}    
    dest: /etc/hadoop/{{item}}  
  loop:    
  - core-site.xml    
  - hdfs-site.xml

Step4: Creating the Directory for data_storage

Here this task will create a different direc in the datanodes for storing data and in namenode for creating a format table to handle datanodes details.

- name: creating direc  
  file:    
     path: "{{ direc_name }}"    
     state: directory    
     mode: '0755'

Step4. Formatting the Namenode

Namenodes needs to be formatted for the very first time, so that it will be ready to store info about the datanodes.

- name: formatting namenode
  shell: "echo Y|hadoop namenode -format"  
  args:      
    warn: no  
  when:     
    node_name == "namenode"

Step5. Starting the nodes

This is the final task which will helps us to start the nodes or the services.

- name: starting the services  
  command: sudo hadoop-daemon.sh start "{{ node_name }}"  
  args:     
    warn: no  
  ignore_errors: yes

Now let’s deply this role by creating a simple playbook, I had launched 2 datanodes with tags(name = datanode) and 1 namenode with tag(name = namenode).

- hosts: tag_name_namenode
  vars:
   direc_name: /home/ec2-user/nn
   node_name: namenode  tasks: 
  - name: configuring namenode
    include_role: 
      name: ansible-hadoop
    
- hosts: tag_name_datanode
  vars:
   direc_name: /home/ec2-user/nn
   node_name: datanode  tasks: 
  - name: configuring datanode
    include_role: 
      name: ansible-hadoop