The Medallion Masterstroke: How Databricks Rewired the Data World One Bronze Layer at a Time

The Era of Chaos – and Snowflake’s Rise Back in 2017, most of us were drowning in messy data. Files were everywhere in S3 buckets, Hadoop jobs kept failing at the worst times, and analysts? They were always chasing clean data that never seemed to arrive when needed. It was frustrating, and honestly, it felt … Read more

Mastering Python Setup on macOS: Bye Conda, Hello pyenv + Fancy iTerm2 Terminal

Python development setup on macOS using pyenv and iTerm2 terminal, as described in the blog on mastering Python setup without Conda.

Tired of messy Python setups? Ever screamed at your terminal? Been there, done that, deleted Anaconda. Let me show you how I set up a clean, beautiful, and powerful Python development environment on my Mac. It’s light, customizable, and perfect for devs who love a good-looking terminal and tight control over Python versions. 🤓 Why … Read more

Python Data Structures Simplified: List, Tuple, Dict, Set, Frozenset & More

Colorful icons representing Python data structures — List, Tuple, Dict, Set — visual from the blog Python Data Structures Simplified.

Python offers a rich set of built-in and extended data structures to efficiently manage and process data. In this blog, we’ll deep dive into essential ones: List, Tuple, Dictionary (Dict), Set, Frozenset, and also explore some powerful structures from the collections and dataclasses modules. We’ll cover their properties, use-cases, constructors, and how to convert between them using intuitive examples. Note: Since Python 3.7, dictionaries … Read more

Understanding SQL Execution Order and Corresponding PySpark Syntax

When writing SQL queries, it is essential to understand the order in which SQL clauses are executed. This helps in writing optimized queries, especially when transitioning from SQL to PySpark. In this blog, we’ll walk you through the SQL execution order, the SQL clauses, and provide their corresponding PySpark syntax. SQL Execution Order and Corresponding … Read more

Apache Spark – Performance Tuning and Best Practices

Visual representation of Apache Spark performance tuning with Spark logo and performance gauge, from the blog Apache Spark – Performance Tuning and Best Practices.

Apache Spark has revolutionized the way we process large-scale data — delivering unparalleled speed, scalability, and flexibility. But as many engineers discover, achieving optimal performance in Spark is far from automatic. Your job runs — but takes longer than expected. The cluster scales — but the costs rise disproportionately. Memory errors appear out of nowhere. … Read more

Data Serialisation – Avro vs Protocol Buffers

Visual comparison of Avro vs Protocol Buffers for data serialisation, with arrows representing data flow for each format

Background File Formats Evolution Why not use CSV/XML/JSON?  Repeated or no meta information. Files are not splittable, so cannot be used in a map-reduce environment. Missing/ Limited schema definition and evolution support. Can leverage “JsonSchema” to maintain schema separately for JSON. It may still require transformation based on a schema, so why not consider Avro/Proto? … Read more

Count(*) – Explaining different behaviour in Joins

Observations :  Count(1) or Count(*) – This is never expanded on each column individually so will work perfectly fine on complete data.  Count(1) is more optimized then Count(*) Count(source.*) – source represents “Left table” of “Left Outer Join”: This will be evaluated as Count(source.col1, source.col2, …. source.colN ) So, if any column has NULL, then the complete row … Read more

Cost and Performance Analysis : CSV and Parquet Format

I was doing some cost comparison of using CSV files vs Parquet File. Interestingly, when using Parquet format, data scanning for similar queries, cost 99% less as compared to CSV format. Queries ( Mentioned only for Parquet) CSV ( 11.32 GB )Run Time (in sec) CSV ( 11.32 GB )DataScanned (in GB) PARQUET ( 4.1 GB )Run Time (in sec) PARQUET ( 4.1 GB )DataScanned (in GB) … Read more

Understand the FOR loop

This post was originally posted in my first blog — learntheprogramming.blogspot.com A “for” loop allows code to be repeatedly executed and is classified as an iteration statement.Unlike many other kinds of loops, such as the while loop, the for loop is often distinguished by an explicit loop counter or loop variable. This allows the body … Read more