optimisation Archives - DataForGeeks

Snowflake – Performance Tuning and Best Practices

June 10, 2025May 14, 2022 by Nikhil Aggarwal

Snowflake Performance Tuning with charts and best practices visual illustration

Snowflake’s cloud-native architecture makes it incredibly easy to get started — but running it efficiently at scale is a whole different game. If you’ve ever faced slow queries, ballooning credit consumption, or unpredictable performance, you’re not alone. Tuning Snowflake workloads requires more than just adjusting warehouse sizes — it involves understanding how Snowflake stores data, … Read more

Apache Spark – Performance Tuning and Best Practices

June 10, 2025May 4, 2022 by Nikhil Aggarwal

Apache Spark has revolutionized the way we process large-scale data — delivering unparalleled speed, scalability, and flexibility. But as many engineers discover, achieving optimal performance in Spark is far from automatic. Your job runs — but takes longer than expected. The cluster scales — but the costs rise disproportionately. Memory errors appear out of nowhere. … Read more

Count(*) – Explaining different behaviour in Joins

April 20, 2022February 4, 2022 by Nikhil Aggarwal

Observations : Count(1) or Count(*) – This is never expanded on each column individually so will work perfectly fine on complete data. Count(1) is more optimized then Count(*) Count(source.*) – source represents “Left table” of “Left Outer Join”: This will be evaluated as Count(source.col1, source.col2, …. source.colN ) So, if any column has NULL, then the complete row … Read more

Cost and Performance Analysis : CSV and Parquet Format

April 20, 2022May 27, 2020 by Nikhil Aggarwal

I was doing some cost comparison of using CSV files vs Parquet File. Interestingly, when using Parquet format, data scanning for similar queries, cost 99% less as compared to CSV format. Queries ( Mentioned only for Parquet) CSV ( 11.32 GB )Run Time (in sec) CSV ( 11.32 GB )DataScanned (in GB) PARQUET ( 4.1 GB )Run Time (in sec) PARQUET ( 4.1 GB )DataScanned (in GB) … Read more

Impala – Create Table AS Select * FROM Table – is SLOW

March 24, 2022July 22, 2016 by Nikhil Aggarwal

Below query seems like the simplest way to create a replica of table. But simplicity comes with some cost as well. Above query will : NOT create partitions if there are any on TABLE_NAME_2 run very slow Instead of above we should follow following 2 way approach : CREATE TABLE TABLE_NAME Like TABLE_NAME_2; — … Read more

HDFS – Data Movement across clusters

March 24, 2022July 21, 2016 by Nikhil Aggarwal

You can move data in HDFS cluster using distcp command. distcp uses 10 mappers by default to bring data from source system. While doing data movement I encountered a problem in which data movement was failing because of checksum mismatch. If any block mismatch in the checksum then the complete data block was getting discarded. Checksum is … Read more

Impala – Use Incremental stats instead of Full Table stats

March 24, 2022July 21, 2016 by Nikhil Aggarwal

If you have a table which is partitioned on a column then doingCompute stats TABLE_NAMEwill execute on all partitions. Internally compute stats run NDV function on each column to get numbers. However NDV function works faster then other count(COLUMN), but it will run for each partition which may be irrelevant when you are working/updating/modifying values … Read more

Impala – Optimise query when using to_utc_timestamp() function

May 29, 2025July 21, 2016 by Nikhil Aggarwal

From 40 minutes to just 4 minutes Impala to_utc_timestamp() function is used to convert date/timestamp timezone to UTC. But it works very slow. If you have less data in table even then you can easily notice its slow performance. I faced a similar issue and noticed it was taking around 40 minutes alone to complete … Read more

Impala – Optimisation at partition level

March 27, 2022June 21, 2016 by Nikhil Aggarwal

We all know that to optimise our queries these 3 strategies are like most common : Partitioned table Bucketing Collecting Stats But sometimes a simple query will run on ALL partitions instead of one. You may notice your query should work on one partition but it will run on all partitions.Let me show you an … Read more

Sqoop – Handle NULL values

March 24, 2022June 28, 2015 by Nikhil Aggarwal

By default Sqoop import NULL as null, if you want to change this default configuration you can use following arguments. While importing data : –null-string –null-non-string While exporting data : –input-null-string –input-null-non-string Check this example for more clarification : In above example : –null-string argument represents what should be writtern in HDFS whenever a NULL is identified in … Read more