What It Really Takes to Run Snowpipe in Production at Scale – A Comprehensive Guide

We adopted a practical Medallion-style approach to structure our data flows—segmenting data flows into Bronze, Silver, and Gold layers. As part of this redesign, we needed to optimize how curated data was exported to Snowflake. That’s when we hit performance issues with external tables. I know the common suggestion is to use the COPY command—but … Read more

The Medallion Masterstroke: How Databricks Rewired the Data World One Bronze Layer at a Time

The Era of Chaos – and Snowflake’s Rise Back in 2017, most of us were drowning in messy data. Files were everywhere in S3 buckets, Hadoop jobs kept failing at the worst times, and analysts? They were always chasing clean data that never seemed to arrive when needed. It was frustrating, and honestly, it felt … Read more

Mastering Python Setup on macOS: Bye Conda, Hello pyenv + Fancy iTerm2 Terminal

Tired of messy Python setups? Ever screamed at your terminal? Been there, done that, deleted Anaconda. Let me show you how I set up a clean, beautiful, and powerful Python development environment on my Mac. It’s light, customizable, and perfect for devs who love a good-looking terminal and tight control over Python versions. 🤓 Why … Read more

Python Data Structures Simplified: List, Tuple, Dict, Set, Frozenset & More

Python offers a rich set of built-in and extended data structures to efficiently manage and process data. In this blog, we’ll deep dive into essential ones: List, Tuple, Dictionary (Dict), Set, Frozenset, and also explore some powerful structures from the collections and dataclasses modules. We’ll cover their properties, use-cases, constructors, and how to convert between them using intuitive examples. Note: Since Python 3.7, dictionaries … Read more

Understanding SQL Execution Order and Corresponding PySpark Syntax

When writing SQL queries, it is essential to understand the order in which SQL clauses are executed. This helps in writing optimized queries, especially when transitioning from SQL to PySpark. In this blog, we’ll walk you through the SQL execution order, the SQL clauses, and provide their corresponding PySpark syntax. SQL Execution Order and Corresponding … Read more

Snowflake – Performance Tuning and Best Practices

Note: This article is a compilation effort of multiple performance tuning methodologies in Snowflake. Some Text/Images in the following article has been referred from various interesting articles and book, details of which are captured under “References”. Introduction to Snowflake Snowflake is a SaaS-based Data Warehouse platform built over AWS (and other clouds) infrastructure. One of the … Read more

Apache Spark – Performance Tuning and Best Practices

Note: This article is a compilation effort of multiple performance tuning methodologies in Apache Spark. Text/Images in following article has been referred from various interesting articles and book, details of which are captured under “References”. Tweak Configurations Viewing and Setting Apache Spark Configurations 4 ways of doing it : Way-1:Using $SPARK_HOME directory (Configuration changes in … Read more

Data Serialisation – Avro vs Protocol Buffers

Background File Formats Evolution Why not use CSV/XML/JSON?  Repeated or no meta information. Files are not splittable, so cannot be used in a map-reduce environment. Missing/ Limited schema definition and evolution support. Can leverage “JsonSchema” to maintain schema separately for JSON. It may still require transformation based on a schema, so why not consider Avro/Proto? … Read more

Count(*) – Explaining different behaviour in Joins

Observations :  Count(1) or Count(*) – This is never expanded on each column individually so will work perfectly fine on complete data.  Count(1) is more optimized then Count(*) Count(source.*) – source represents “Left table” of “Left Outer Join”: This will be evaluated as Count(source.col1, source.col2, …. source.colN ) So, if any column has NULL, then the complete row … Read more