We all know that to optimise our queries these 3 strategies are like most common :
- Partitioned table
- Bucketing
- Collecting Stats
But sometimes a simple query will run on ALL partitions instead of one. You may notice your query should work on one partition but it will run on all partitions.
Let me show you an example :
select count(*) FROM Partitioned_table PT JOIN Month_id MI ON PT.paritioned_id = MI.id WHERE MI.year_month = 201606;
Month_id id | year_month 1 | 201606 Partitioned_table partitioned_id | other columns 1 | .......
In the above we have one reference table Month_id, this table contains id for a year_month. And in Partitioned_table we have partitions defined on column partitioned_id column.
This looks simple in terms that a join will be evaluated and then where clause will filter out the records as per predicate.
But I noticed in Impala, it first applies predicate and then makes join. So, with many alternatives I tried, it never ran on a single partition. Instead, when I added a filter on partitioned_id it worked on 1 partition.
Queries stats were as follows :
Execution time = ~50 minutes
Total partitions = 11/11
Total HDFS Scan = 20TB
Instead of deriving the value of the partitioned_id column from table, I passed the partitioned value to the query and ran it. This brought the time to just 5.2 minutes.
New Query stats :
Execution time = 5.2 minutes
Partitions scan = 1/11
Total HDFS Scan = 1TB
I read somewhere that there is a concept of Predicate Propagation which was not supported in Impala version. But your suggestions are welcome.