Hi Guys ,
This blog is regarding the data integrity in HDFS
as we all know that there are always chances of corruption of data during storage or processing of data , i am talking about only small data not big data because every operation of i/o carries small probability of data being corrupted .
AS hadoop deals with large amount of data , therefore chances of corruption of data increases as well , but hadoop by its own have a mechanism which can deal with corruption of data , known as checksum mechanism.
NOTE :- this mechanism is only deals with checking of corrupted data , not fixing of that corrupted data .
When we create/write a file( eg :- file1) in hadoop local filesystem an another checksum file will be created by filesystem name as file1.crc.
Now there are 2 cases when integrity of data needs to check :-
- File/Data Read
- File/Data Write
File/Data Read :-
when client read data from datanodes . checksum will be compared of file , with the one which is stored on data node already , if both check sum are equal and identical then no corruption will be deducted otherwise a checksum exception will be thrown , but before this client will try to read data using namenmode.Namenode will mark it as corrupted block and no other client will direct to it again,
Then namenode will schedule a replication of non corrupted block from another data node to it .
File/Data Write :-
While writing , datanodes are responsible for verifying the checksum of data/block . when checksum is created for new file which we are going to write , the third datanode in pipeline( assuming 3 replication factor) will compare the checksum of data to the checksum created at strating and of the file which it is going to write . if both are same then no corruption reported otherwise throws an ChecksumException .
In this manner checksum / corruption deduction mechanism takes place in hadoop.
hope you guys like it .
thanks and cheers .