Understanding SQL Execution Order and Corresponding PySpark Syntax


When writing SQL queries, it is essential to understand the order in which SQL clauses are executed. This helps in writing optimized queries, especially when transitioning from SQL to PySpark. In this blog, we’ll walk you through the SQL execution order, the SQL clauses, and provide their corresponding PySpark syntax.

SQL Execution Order and Corresponding PySpark Syntax

https://bytebytego.com/

Below is a detailed table outlining the execution order for SQL clauses, a description of each step, and the corresponding PySpark syntax. This will act as a quick reference for anyone transitioning between SQL and PySpark for data processing.



SQL Clause Execution Order



SQL Clause



Description



Spark (PySpark) Syntax



1



FROM / JOIN



Data is retrieved from the source tables and joins are performed if needed.



df1.join(df2, df1.column == df2.column)



2



ON



Join condition, specifying how to match records from multiple tables.



df1.join(df2, “column_name”)



3



WHERE



Filters rows based on the given conditions (before grouping or aggregation).



df.filter(“column_name > value”) or df.filter(col(“column_name”) > value)



4



GROUP BY



Groups rows for aggregation.



df.groupBy(“column_name”).agg(func.sum(“other_column”).alias(“sum_value”))



5



HAVING



Filters groups based on conditions after aggregation.



df.groupBy(“column_name”).agg(func.count(“*”).alias(“count”)).filter(“count > 5”)



6



SELECT



Specifies the columns to be included in the result set, as well as expressions.



df.select(“column1”, “column2”) or df.select(col(“column1”), (col(“column2”) * 2).alias(“double_column”))



7



DISTINCT



Removes duplicate rows after the SELECT operation.



df.distinct()



8



UNION / UNION ALL / INTERSECT / EXCEPT



Combines results from multiple queries, returning the union, intersection, or difference.



df1.union(df2) or df1.unionAll(df2) (Spark’s unionAll is equivalent to SQL’s UNION ALL)  df1.intersect(df2) or df1.except(df2)



9



ORDER BY



Sorts the results based on specified column(s) in ascending or descending order.



df.orderBy(“column_name”, ascending=True) or df.orderBy(col(“column_name”).desc())



10



LIMIT / OFFSET



Restricts the number of rows returned (pagination or row limitation).



df.limit(10) (for limiting rows)

Rest can vary      


11



WITH (CTE)



Defines Common Table Expressions (CTEs), allowing the use of temporary result sets in the query.



Spark doesn’t directly support CTEs, but you can create temporary views: df.createOrReplaceTempView(“view_name”)



12



CASE / WHEN



Provides conditional logic within the SELECT or WHERE clause.



df.withColumn(“new_column”, when(col(“column_name”) > 100, “High”).otherwise(“Low”))



13



JOIN Types (INNER, LEFT, RIGHT, FULL)



Specifies the type of join (inner, left outer, right outer, full outer).



df1.join(df2, “column_name”, “left”) or df1.join(df2, “column_name”, “inner”) or df1.join(df2, “column_name”, “full”) or df1.join(df2, “column_name”, “right”)



14



IN / NOT IN



Filters data based on whether a value matches a set of values.



df.filter(col(“column_name”).isin(value1, value2, value3)) or df.filter(~col(“column_name”).isin(value1, value2))



15



LIKE



Used for pattern matching with strings.



df.filter(col(“column_name”).rlike(“pattern”))



16



IS NULL / IS NOT NULL



Checks for NULL values in columns.



df.filter(col(“column_name”).isNull()) or df.filter(col(“column_name”).isNotNull())



17



ROLLUP / CUBE / GROUPING SETS



Used for multi-level aggregation. ROLLUP gives subtotals; CUBE gives all combinations of subtotals.



df.groupBy(“column1”).rollup(“column2”).agg(func.sum(“value”)) or df.groupBy(“column1”).cube(“column2”).agg(func.avg(“value”))



18



WINDOW Functions (OVER)



Used for performing operations over a specific window (like running totals).



df.withColumn(“running_total”, sum(“value”).over(Window.orderBy(“date”)))

Conclusion

This table provides an easy-to-understand guide to SQL execution order and its corresponding PySpark syntax. Understanding this flow allows you to effectively transition from writing SQL queries to performing the same operations in PySpark.

The table offers a quick reference for handling common SQL operations such as joins, grouping, aggregation, sorting, and window functions, along with their direct equivalents in PySpark.

This should help beginners and experienced users alike improve their knowledge and productivity when working with large-scale data using Apache Spark.

Happy coding!