Hi Guys ,

This blog is regarding the data integrity in HDFS

as we all know that there are always chances of corruption of data during storage or processing of data , i am talking about only small data not big data because every operation of i/o carries small probability of data being corrupted .

AS hadoop deals with large amount of data , therefore chances of corruption of data increases as well , but hadoop by its own have a mechanism which can deal with corruption of data , known as checksum mechanism.

NOTE :- this mechanism is only deals with checking of corrupted data , not fixing of that corrupted data .

When we create/write a file( eg :- file1) in hadoop local filesystem  an another checksum file will be created by filesystem name as file1.crc.

Now there are 2 cases when integrity of data needs to check :-

  1. File/Data Read
  2. File/Data Write

 File/Data Read :- 

when client read data from datanodes .  checksum will be compared of file , with the one which is stored on data node already , if both check sum are equal and identical then no corruption will be deducted otherwise a checksum exception will be thrown , but before this client will try to read data using namenmode.Namenode will mark it as corrupted block and no other client will direct to it again,

Then namenode will schedule a replication of non corrupted block from another data node to it .

 File/Data Write :- 

While writing , datanodes are responsible for verifying the checksum of data/block . when checksum is created for new file which we are going to write  , the third datanode in pipeline( assuming 3 replication factor) will compare the checksum of data to the checksum created at strating and of the file which it is going to write . if both are same then no corruption reported otherwise throws an ChecksumException .

In this manner checksum / corruption deduction mechanism takes place in hadoop.

hope you guys like it .

thanks and cheers .