Impala – Optimise query when using to_utc_timestamp() function

From 40 minutes to just 4 minutes

Impala to_utc_timestamp() function is used to convert date/timestamp timezone to UTC. But it works very slow. If you have less data in table even then you can easily notice its slow performance. 

I faced a similar issue and noticed it was taking around 40 minutes alone to complete a query with 600 million records to convert timezone. At the end of the month, these numbers get twice and running time gets extended to more than 80 minutes. It was a simple conversion and should not be taking this much time. As data is growing every month this may increase to an extent that would cause job failures.  I tried many steps and came up with workaround to decrease this time.

Sharing all methods : 

  1. Compute stats – After computing stats I was hoping that it will bring down some time. But computing stats is itself a heavy operation and instead of  optimising the code , it added 15 more minutes in execution.
  2. UDF – I thought about creating UDF and resolve this issue, but I kept this option for last . As managing UDF is itself a challenge when working in enterprise wise application. Also during upgrades testing of UDF add more complexity to application. 
  3. Tried some random queries to convert timezone in it with some formulae. But it did not work. Also, relying on a formula with conversion like this is not a good practice. 
  4. I tried some other alternatives as well, but above 3 approaches gave me some insight in data which made optimisation simpler but tricky.
    1. When I checked stats at column level I noticed , even though we have 600 million records in table , but number of unique records to be converted is just 1 million.
    2. There were two columns on which we were doing conversion ( 1200 million conversions ) . Instead of this I collected all unique values in these two columns ( only 1 million conversions) and applied conversion on this. — This only took 20 seconds. 🙂
    3. Once, we have the above values I applied a join with table and picked the required columns. — This took 4 minutes.

Total ~4 minutes ( down from 40 minutes) — this stats is of Daily run. In the monthly run, this is going to save an hour at a minimum.

So I noticed instead of optimising queries at the system level we should analyse our data first and should try to start optimisation from data level.