DataForGeeks

What It Really Takes to Run Snowflake’s Snowpipe in Production at Scale – A Comprehensive Guide

June 10, 2025May 28, 2025 by Nikhil Aggarwal

We adopted a practical Medallion-style approach to structure our data flows – segmenting data flows into Bronze, Silver, and Gold layers. As part of this redesign, we needed to optimize how curated data was exported to Snowflake. That’s when we hit performance issues with external tables. I know the common suggestion is to use the … Read more

Apache Iceberg: The Data Lake Breakthrough That’s Reshaping the Big Data Landscape

June 9, 2025May 21, 2025 by Nikhil Aggarwal

By the end of this read, you’ll understand why Apache Iceberg is not just another open table format — it’s the seismic shift that’s redefining the role of Delta Lake in the modern data ecosystem and transform how the world thinks about data lakes. From Data Lakes to Data Icebergs: A New Era Begins Over … Read more

The Medallion Masterstroke: How Databricks Rewired the Data World One Bronze Layer at a Time

June 9, 2025January 20, 2025 by Nikhil Aggarwal

The Era of Chaos – and Snowflake’s Rise Back in 2017, most of us were drowning in messy data. Files were everywhere in S3 buckets, Hadoop jobs kept failing at the worst times, and analysts? They were always chasing clean data that never seemed to arrive when needed. It was frustrating, and honestly, it felt … Read more

Mastering Python Setup on macOS: Bye Conda, Hello pyenv + Fancy iTerm2 Terminal

June 10, 2025October 9, 2024 by Nikhil Aggarwal

Python development setup on macOS using pyenv and iTerm2 terminal, as described in the blog on mastering Python setup without Conda.

Tired of messy Python setups? Ever screamed at your terminal? Been there, done that, deleted Anaconda. Let me show you how I set up a clean, beautiful, and powerful Python development environment on my Mac. It’s light, customizable, and perfect for devs who love a good-looking terminal and tight control over Python versions. 🤓 Why … Read more

Python Data Structures Simplified: List, Tuple, Dict, Set, Frozenset & More

June 10, 2025May 2, 2024 by Nikhil Aggarwal

Colorful icons representing Python data structures — List, Tuple, Dict, Set — visual from the blog Python Data Structures Simplified.

Python offers a rich set of built-in and extended data structures to efficiently manage and process data. In this blog, we’ll deep dive into essential ones: List, Tuple, Dictionary (Dict), Set, Frozenset, and also explore some powerful structures from the collections and dataclasses modules. We’ll cover their properties, use-cases, constructors, and how to convert between them using intuitive examples. Note: Since Python 3.7, dictionaries … Read more

Understanding SQL Execution Order and Corresponding PySpark Syntax

June 10, 2025September 2, 2023 by Nikhil Aggarwal

When writing SQL queries, it is essential to understand the order in which SQL clauses are executed. This helps in writing optimized queries, especially when transitioning from SQL to PySpark. In this blog, we’ll walk you through the SQL execution order, the SQL clauses, and provide their corresponding PySpark syntax. SQL Execution Order and Corresponding … Read more

Snowflake – Performance Tuning and Best Practices

June 10, 2025May 14, 2022 by Nikhil Aggarwal

Snowflake Performance Tuning with charts and best practices visual illustration

Snowflake’s cloud-native architecture makes it incredibly easy to get started — but running it efficiently at scale is a whole different game. If you’ve ever faced slow queries, ballooning credit consumption, or unpredictable performance, you’re not alone. Tuning Snowflake workloads requires more than just adjusting warehouse sizes — it involves understanding how Snowflake stores data, … Read more

Apache Spark – Performance Tuning and Best Practices

June 10, 2025May 4, 2022 by Nikhil Aggarwal

Apache Spark has revolutionized the way we process large-scale data — delivering unparalleled speed, scalability, and flexibility. But as many engineers discover, achieving optimal performance in Spark is far from automatic. Your job runs — but takes longer than expected. The cluster scales — but the costs rise disproportionately. Memory errors appear out of nowhere. … Read more

Data Serialisation – Avro vs Protocol Buffers

June 10, 2025March 23, 2022 by Nikhil Aggarwal

Visual comparison of Avro vs Protocol Buffers for data serialisation, with arrows representing data flow for each format

Background File Formats Evolution Why not use CSV/XML/JSON? Repeated or no meta information. Files are not splittable, so cannot be used in a map-reduce environment. Missing/ Limited schema definition and evolution support. Can leverage “JsonSchema” to maintain schema separately for JSON. It may still require transformation based on a schema, so why not consider Avro/Proto? … Read more

Count(*) – Explaining different behaviour in Joins

April 20, 2022February 4, 2022 by Nikhil Aggarwal

Observations : Count(1) or Count(*) – This is never expanded on each column individually so will work perfectly fine on complete data. Count(1) is more optimized then Count(*) Count(source.*) – source represents “Left table” of “Left Outer Join”: This will be evaluated as Count(source.col1, source.col2, …. source.colN ) So, if any column has NULL, then the complete row … Read more