HADOOP:HDFS: Data Integrity

Hi Guys ,

This blog is regarding the data integrity in HDFS

as we all know that there are always chances of corruption of data during storage or processing of data , i am talking about only small data not big data because every operation of i/o carries small probability of data being corrupted .

AS hadoop deals with large amount of data , therefore chances of corruption of data increases as well , but hadoop by its own have a mechanism which can deal with corruption of data , known as checksum mechanism.

NOTE :- this mechanism is only deals with checking of corrupted data , not fixing of that corrupted data .

When we create/write a file( eg :- file1) in hadoop local filesystem  an another checksum file will be created by filesystem name as file1.crc.

Now there are 2 cases when integrity of data needs to check :-

  1. File/Data Read
  2. File/Data Write

 File/Data Read :- 

when client read data from datanodes .  checksum will be compared of file , with the one which is stored on data node already , if both check sum are equal and identical then no corruption will be deducted otherwise a checksum exception will be thrown , but before this client will try to read data using namenmode.Namenode will mark it as corrupted block and no other client will direct to it again,

Then namenode will schedule a replication of non corrupted block from another data node to it .

 File/Data Write :- 

While writing , datanodes are responsible for verifying the checksum of data/block . when checksum is created for new file which we are going to write  , the third datanode in pipeline( assuming 3 replication factor) will compare the checksum of data to the checksum created at strating and of the file which it is going to write . if both are same then no corruption reported otherwise throws an ChecksumException .

In this manner checksum / corruption deduction mechanism takes place in hadoop.

hope you guys like it .

thanks and cheers .



3 thoughts on “HADOOP:HDFS: Data Integrity

  1. How integrity is checked when file is transferred from local file system to hdfs????

    What is the guarantee about corruption of files transferred to hdfs???


  2. Thanks for reply.
    I transferred a file from local file system using curl command to hdfs.

    Want to check if file transfer is successful or not???

    Usually when we download iso file we validate the download using checksum information. Is there any way of this kind when we transfer file from local file system to hdfs????


