Sqoop – Optimise Import

Importing data using Sqoop is one of the most time consuming task of BigData environment.

Sqoop is a powerful yet simple tool to import data from different RDBMSs into HDFS. But while importing data following 2 points should be considered with higher priority to reduce time :

  1. Number of Mappers: Mapper provides parallelism while importing data into HDFS. If you know the key(–split-by) on which data would be equally distributed or each mapper will get the almost equal size of data then you will use this property very often.
    1. Precaution: Each mapper should almost equal size of data to import otherwise one mapper with the most load will end in last and will increase your overall importing time.
  2. Exec fetch: This is another property that can speed up your import process. By default, it is set to 1000, so it will fetch data in chunks of 1000 records. You can play around with this property a bit also to speed up importing. ( I usually prefer making it 10,000 ) .