Nikhil Aggarwal, Author at DataForGeeks

What It Really Takes to Run Snowpipe in Production at Scale – A Comprehensive Guide

May 30, 2025May 28, 2025 by Nikhil Aggarwal

We adopted a practical Medallion-style approach to structure our data flows—segmenting data flows into Bronze, Silver, and Gold layers. As part of this redesign, we needed to optimize how curated data was exported to Snowflake. That’s when we hit performance issues with external tables. I know the common suggestion is to use the COPY command—but … Read more

Apache Iceberg: The Data Lake Breakthrough That’s Reshaping the Big Data Landscape

May 28, 2025May 21, 2025 by Nikhil Aggarwal

By the end of this read, you’ll understand why Apache Iceberg is not just another open table format — it’s the seismic shift that’s redefining the role of Delta Lake in the modern data ecosystem and transform how the world thinks about data lakes. From Data Lakes to Data Icebergs: A New Era Begins Over … Read more

The Medallion Masterstroke: How Databricks Rewired the Data World One Bronze Layer at a Time

May 29, 2025January 20, 2025 by Nikhil Aggarwal

The Era of Chaos – and Snowflake’s Rise Back in 2017, most of us were drowning in messy data. Files were everywhere in S3 buckets, Hadoop jobs kept failing at the worst times, and analysts? They were always chasing clean data that never seemed to arrive when needed. It was frustrating, and honestly, it felt … Read more

Mastering Python Setup on macOS: Bye Conda, Hello pyenv + Fancy iTerm2 Terminal

May 20, 2025October 9, 2024 by Nikhil Aggarwal

Tired of messy Python setups? Ever screamed at your terminal? Been there, done that, deleted Anaconda. Let me show you how I set up a clean, beautiful, and powerful Python development environment on my Mac. It’s light, customizable, and perfect for devs who love a good-looking terminal and tight control over Python versions. 🤓 Why … Read more

Python Data Structures Simplified: List, Tuple, Dict, Set, Frozenset & More

May 2, 2025May 2, 2024 by Nikhil Aggarwal

Python offers a rich set of built-in and extended data structures to efficiently manage and process data. In this blog, we’ll deep dive into essential ones: List, Tuple, Dictionary (Dict), Set, Frozenset, and also explore some powerful structures from the collections and dataclasses modules. We’ll cover their properties, use-cases, constructors, and how to convert between them using intuitive examples. Note: Since Python 3.7, dictionaries … Read more

Understanding SQL Execution Order and Corresponding PySpark Syntax

May 29, 2025September 2, 2023 by Nikhil Aggarwal

When writing SQL queries, it is essential to understand the order in which SQL clauses are executed. This helps in writing optimized queries, especially when transitioning from SQL to PySpark. In this blog, we’ll walk you through the SQL execution order, the SQL clauses, and provide their corresponding PySpark syntax. SQL Execution Order and Corresponding … Read more

Snowflake – Performance Tuning and Best Practices

May 15, 2022May 14, 2022 by Nikhil Aggarwal

Note: This article is a compilation effort of multiple performance tuning methodologies in Snowflake. Some Text/Images in the following article has been referred from various interesting articles and book, details of which are captured under “References”. Introduction to Snowflake Snowflake is a SaaS-based Data Warehouse platform built over AWS (and other clouds) infrastructure. One of the … Read more

Apache Spark – Performance Tuning and Best Practices

May 29, 2025May 4, 2022 by Nikhil Aggarwal

Note: This article is a compilation effort of multiple performance tuning methodologies in Apache Spark. Text/Images in following article has been referred from various interesting articles and book, details of which are captured under “References”. Tweak Configurations Viewing and Setting Apache Spark Configurations 4 ways of doing it : Way-1:Using $SPARK_HOME directory (Configuration changes in … Read more

Data Serialisation – Avro vs Protocol Buffers

May 29, 2025March 23, 2022 by Nikhil Aggarwal

Background File Formats Evolution Why not use CSV/XML/JSON? Repeated or no meta information. Files are not splittable, so cannot be used in a map-reduce environment. Missing/ Limited schema definition and evolution support. Can leverage “JsonSchema” to maintain schema separately for JSON. It may still require transformation based on a schema, so why not consider Avro/Proto? … Read more

Count(*) – Explaining different behaviour in Joins

April 20, 2022February 4, 2022 by Nikhil Aggarwal

Observations : Count(1) or Count(*) – This is never expanded on each column individually so will work perfectly fine on complete data. Count(1) is more optimized then Count(*) Count(source.*) – source represents “Left table” of “Left Outer Join”: This will be evaluated as Count(source.col1, source.col2, …. source.colN ) So, if any column has NULL, then the complete row … Read more