Let's check Hadoop Parallelism
--
Myth About Hadoop Distributed File System
In this article we are going to discuss about a myth about the file system of hadoop HDFS cluster. There is a great myth about HDFS cluster that the file sharing from client side to the data nodes is a parallel procedure means the blocks are sent in parallel way. But that is NOT TRUE.
The file is shared from client side to directly data nodes. And the blocks are transferred one by one. For replication it uses serial method. For eg. block transfer from client to any random selected datanode. From this datanode the block is replicated to other datanodes directly. Then after the first block is transferred completed and replicas are formed. Then the 2nd block comes in play.
Hadoop Cluster : Hadoop cluster consists of a name node(Master node) and a number of data node(Slave nodes).
IP of datanode 1: 65.0.106.98
IP of datanode 2: 13.232.181.200
IP of datanode 3: 35.154.168.85
IP of client : 192.168.43.243(Private) , 42.111.11.217(Public)
Now let’s see transfer path of the files and blocks whether it is serial or parallel. From client side, put a file to the HDFS cluster. By using the below command we have put a file named arth.txt of approx. 10MB in size to the cluster.
hadoop fs -put arth.txt \
For replica number and block size setup we use hdfs-site.xml file present in /etc/hadoop/ folder. Below I have given this file how to set replica number and block size. In our setup number of replica is 3 and the block size is 5MB. So that the arth.txt file is divided in two blocks.
Now to transfer the first block ‘Block A’, client OS first pick any random data node and transfer this block to the data node and this data node transfer this block to other datanode and again the 2nd datanode transfer it to 3rd datanode for replication.
Block A:
Client transfer this block to first datanode of IP 65.0.106.98 We get this information by using this command:
tcpdump -i enp0s3 “port 50010 and (src 13.232.181.200 or src 65.0.106.98 or src 35.154.168.85)” -n > client.txt
In this command it capture the network packet at port 50010 and source 13.232.181.200 , 65.0.106.98 and 35.154.168.85 and then save the output in the file named ‘client.txt’.
client
Now the datanode1(65.0.106.98) choose datanode2 to transfer this block .As shown in below diagram datanode1 is getting the block from the client and transferring it to datanode2 (13.232.181.200).
NOTE:To capture the network packets in datanodes ,we have to use this command in every datanode(in {datanode} put the ip of other two datanode except the datanode ip in which you are running this command).
tcpdump -i enp0s3 “port 50010 and (src {client_ip}or src {datanode}or src {datanode})” -n > filename.txt
dn1
Now the datanode2 choose datanode3(35.154.168.85) to transfer this block .As shown in below diagram datanode2 is getting the block from the datanode1 and transferring it to datanode3 (13.232.181.200).
dn2
Now from the below given diagram we can see that datanode3 (35.154.168.85)is getting the block from datanode2(13.232.181.200).
dn3
Block B : After successful replica formation of Block A then again client will select any random datanode and same procedure take place for every block. Block B has taken this type of path for replica:
Here block B firstly transfer from client to the dn3 , dn3 to dn1 ,dn1 to dn2 .
Hence, we have proved this myth wrong. There is no parallel transferring of blocks to the datanodes. For any query related to the article drop a mail
Hope you enjoy reading and keep learning keep sharing.