HDFS – Data Movement across clusters

You can move data in HDFS cluster using distcp command.

distcp uses 10 mappers by default to bring data from source system.

While doing data movement I encountered a problem in which data movement was failing because of checksum mismatch. If any block mismatch in the checksum then the complete data block was getting discarded. 

  • Checksum is the default check on data with distcp command.
  • Used “-pb” option to force the distcp command to check on block size instead of checksum.

Checking on block size gave me control to ensure that no data is lost during transmission, however small errors which might have been introduced are not going to affect my testing much. So I can bear with that.

While transferring around 40TB, I did not face any problem again in which it was failing because of a mismatch in the checksum.

hadoop distcp -pb hdfs://SOURCE_SERVER_IP/HDFS_PATH/*  hdfs://TARGET_SERVER_IP/HDFS_PATH/