From 40 minutes to just 4 minutes
Impala to_utc_timestamp() function is used to convert date/timestamp timezone to UTC. But it works very slow. If you have less data in table even then you can easily notice its slow performance.
I faced a similar issue and noticed it was taking around 40 minutes alone to complete a query with 600 million records to convert timezone. At the end of the month, these numbers get twice and running time gets extended to more than 80 minutes. It was a simple conversion and should not be taking this much time. As data is growing every month this may increase to an extent that would cause job failures. I tried many steps and came up with workaround to decrease this time.
Sharing all methods :
- Compute stats – After computing stats I was hoping that it will bring down some time. But computing stats is itself a heavy operation and instead of optimising the code , it added 15 more minutes in execution.
- UDF – I thought about creating UDF and resolve this issue, but I kept this option for last . As managing UDF is itself a challenge when working in enterprise wise application. Also during upgrades testing of UDF add more complexity to application.
- Tried some random queries to convert timezone in it with some formulae. But it did not work. Also, relying on a formula with conversion like this is not a good practice.
- I tried some other alternatives as well, but above 3 approaches gave me some insight in data which made optimisation simpler but tricky.
- When I checked stats at column level I noticed , even though we have 600 million records in table , but number of unique records to be converted is just 1 million.
- There were two columns on which we were doing conversion ( 1200 million conversions ) . Instead of this I collected all unique values in these two columns ( only 1 million conversions) and applied conversion on this. — This only took 20 seconds. 🙂
- Once, we have the above values I applied a join with table and picked the required columns. — This took 4 minutes.
Total ~4 minutes ( down from 40 minutes) — this stats is of Daily run. In the monthly run, this is going to save an hour at a minimum.
So I noticed instead of optimising queries at the system level we should analyse our data first and should try to start optimisation from data level.