Apache Spark – Performance Tuning and Best Practices

Note: This article is a compilation effort of multiple performance tuning methodologies in Apache Spark. Text/Images in following article has been referred from various interesting articles and book, details of which are captured under “References”. Tweak Configurations Viewing and Setting Apache Spark Configurations 4 ways of doing it : Way-1:Using $SPARK_HOME directory (Configuration changes in … Read more

Data Serialisation – Avro vs Protocol Buffers

Background File Formats Evolution Why not use CSV/XML/JSON?  Repeated or no meta information. Files are not splittable, so cannot be used in a map-reduce environment. Missing/ Limited schema definition and evolution support. Can leverage “JsonSchema” to maintain schema separately for JSON. It may still require transformation based on a schema, so why not consider Avro/Proto? … Read more

Impala – Optimise query when using to_utc_timestamp() function

From 40 minutes to just 4 minutes Impala to_utc_timestamp() function is used to convert date/timestamp timezone to UTC. But it works very slow. If you have less data in table even then you can easily notice its slow performance.  I faced a similar issue and noticed it was taking around 40 minutes alone to complete … Read more

Snowflake – Performance Tuning and Best Practices

Note: This article is a compilation effort of multiple performance tuning methodologies in Snowflake. Some Text/Images in the following article has been referred from various interesting articles and book, details of which are captured under “References”. Introduction to Snowflake Snowflake is a SaaS-based Data Warehouse platform built over AWS (and other clouds) infrastructure. One of the … Read more

Count(*) – Explaining different behaviour in Joins

Observations :  Count(1) or Count(*) – This is never expanded on each column individually so will work perfectly fine on complete data.  Count(1) is more optimized then Count(*) Count(source.*) – source represents “Left table” of “Left Outer Join”: This will be evaluated as Count(source.col1, source.col2, …. source.colN ) So, if any column has NULL, then the complete row … Read more

Cost and Performance Analysis : CSV and Parquet Format

I was doing some cost comparison of using CSV files vs Parquet File. Interestingly, when using Parquet format, data scanning for similar queries, cost 99% less as compared to CSV format. Queries ( Mentioned only for Parquet) CSV ( 11.32 GB )Run Time (in sec) CSV ( 11.32 GB )DataScanned (in GB) PARQUET ( 4.1 GB )Run Time (in sec) PARQUET ( 4.1 GB )DataScanned (in GB) … Read more

Connect to RedShift database from DBeaver

DBeaver is a SQL Client which is helpful in querying data and its analysis. It is loaded with many generic JDBC drivers for connecting to popular databases. For others, we can add driver separately and establish a connection.  Following are the steps to connect to Redshift data from DBeaver :  Step 1 : Create new … Read more

Sales Analysis – 3

Tables – product and sales are provided. Write SQL query to identify products which were sold only in first quarter of year 2019. If a product is sold in any other quarter then that is not supposed to be present in final output. Output should be product_id and product_name. Input Table: sales seller_id product_id buyer_id … Read more

SQL Contest – 14 days hackathon

Started on 1st July, 2019 – 2 weeks long hackathon was conducted, which concluded on 14th July, 2019. Data was collected for each participant and you are expected to write a query to print four columns as below : Contest Date Total number of unique participants who made at least 1 submission every day ( … Read more

Post Views – 2

Post views are being analysed using post_views table. This table contains information of all views on each posts date wise.Task is to find all viewers (viewer_id) who viewed more than one post on same date. Sort the final output in ascending order.   Input Table: post_views post_id author_id viewer_id view_date 22 2 4 2017-01-10 44 … Read more