Note: This article is a compilation effort of multiple performance tuning methodologies in Snowflake. Some Text/Images in the following article has been referred from various interesting articles and book, details of which are captured under “References”.
Introduction to Snowflake
Snowflake is a SaaS-based Data Warehouse platform built over AWS (and other clouds) infrastructure. One of the features behind this software’s popularity with businesses worldwide is its scalability, making it cost-effective. The architecture involves virtual compute instances and efficient storage buckets that run solely on the cloud.
Snowflake provides security and protection of data using Amazon S3 Policy Controls, SSO, Azure SAS Tokens, and Google Cloud Storage access permissions. As mentioned, Snowflake is known for scalability, you can also scale your storage depending on your storage needs. Thus, you can enjoy scalability, and data security, among many other benefits while using a Snowflake Data Warehouse to store your data.
It provides a single platform for data warehousing, data lakes, data engineering, data science, data application development, and secure sharing and consumption of real-time / shared data. Snowflake features out-of-the-box features like separation of storage and compute, on-the-fly scalable compute, data sharing, data cloning, and third-party tools support in order to handle the demanding needs of growing enterprises.
All data in Snowflake tables is automatically divided into micro-partitions, which are contiguous units of storage. Each micro-partition contains between 50 MB and 500 MB of uncompressed data (note that the actual size in Snowflake is smaller because data is always stored compressed).
Micro-partitioning is automatically performed on all Snowflake tables. Tables are transparently partitioned using the ordering of the data as it is inserted/loaded.
Snowflake’s unique architecture consists of three key layers:
- Database Storage
- When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format. Snowflake stores this optimized data in cloud storage.
Snowflake manages all aspects of how this data is stored — the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake.
- When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format. Snowflake stores this optimized data in cloud storage.
- Query Processing
- Query execution is performed in the processing layer. Snowflake processes query using “virtual warehouses”. Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider.
Each virtual warehouse is an independent compute cluster that does not share compute resources with other virtual warehouses. As a result, each virtual warehouse has no impact on the performance of other virtual warehouses.
- Query execution is performed in the processing layer. Snowflake processes query using “virtual warehouses”. Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider.
- Cloud Services
- The cloud services layer is a collection of services that coordinate activities across Snowflake. These services tie together all of the different components of Snowflake in order to process user requests, from login to query dispatch. The cloud services layer also runs on compute instances provisioned by Snowflake from the cloud provider.
- Services managed in this layer include:
- Authentication
- Infrastructure management
- Metadata management
- Query parsing and optimization
- Access control
Snowflake is provided as a Software as a Service solution, with almost zero management or tuning options. This raises the question, How do you tune the Snowflake database when there are no indexes, and few options available to tune the database platform? The fact is, Snowflake was designed for simplicity, with almost no performance tuning options. This article summarises the best practices and tuning options to maximise query performance.
Tuning
The performance issues on any analytics platform generally fall into one of three categories and will discuss tuning options in each of these categories :
Data Loading
Avoid Scanning External Files
Snowflake supports querying data in both ways: managed and external. Data kept on external storage will have slower performance as compared to data managed by Snowflake internally. So, if we can move data to Snowflake we should consider following supported ways by Snowflake for this.
External Table | Managed Table | ||||
Refresh Data Using ? | S3 → SNS | Insert Into ManagedTable Select From ExternalTable |
Copy Command | Snowflake Connector | S3 → Snowpipe |
Manual Refresh / Scheduling Required ? | Automatic | Yes | Yes | Yes | No |
Bulk Load or Continuous Load ? | Both | Bulk | Bulk | Continuous | |
Data Availability Latency ? | Minimum | High |
High based on data volume to be moved. |
High based on data volume to be moved. |
Low based on data volume to be moved.
|
Refresh table Metadata ? | Whenever there is update in Schema Registry, an event can trigger another Lambda/similar service to update DDL of External/Managed table. | ||||
Troubleshooting? | Easy | Easy | Complex | Complex |
Complex Example: 1 |
Partition Evolution |
Yes Use Iceberg Table format (still in preview mode only, expected to release this summer in 2022) |
– | – | – | – |
Protobuf supported ? | No | No | No |
Link: 1 |
No |
Cost | Negligible (for using SNS notifications) | For every file copied in Snowflake | |||
Performance |
Slow Workaround Links: 1 (Materialised view) |
Fast compared to External Table |
- Snowpipe – Enables loading data from files as soon as they’re available in a stage. This means you can load data from files in micro-batches, making it available to users within minutes, rather than manually executing COPY statements on a schedule to load larger batches. So, Snowpipe helps in quickly creating Managed Table, and not to be confused with External Table.
- Following diagram differentiate the usage of External Table and Snowpipe (considering LZ as Landing Zone layer in Snowflake):
Data Design/Transformation
Clustering Keys
For very large tables, typically over a terabyte in size, designers should consider defining a cluster key to maximize query performance. Using a cluster key maximizes partition elimination, and therefore improves query performance.
To illustrate the effect of Snowflake Clustering, consider the diagram below in which data is frequently filtered in the query WHERE clause by DATE.
As data is loaded by date, it tends to be naturally clustered, with all data for the same day falling into the same micro-partition. However, if the following SQL is executed, Snowflake will attempt to keep all sale dates in the same micro-partition. When needed, a background task will automatically re-cluster the data, and this compute processing will be charged as a separate item.
ALTER TABLE sales CLUSTER BY (sales_date);
By using a cluster key, Snowflake can skip over lot of data, but without any of the performance or data management overheads associated with maintaining traditional indexes.
Creating one or more materialised views (clustered or unclustered)
A materialized view is a pre-computed data set derived from a query specification (the SELECT in the view definition) and stored for later use. Because the data is pre-computed, querying a materialized view is faster than executing a query against the base table of the view. This performance difference can be significant when a query is run frequently or is sufficiently complex. As a result, materialized views can speed up expensive aggregation, projection, and selection operations, especially those that run frequently and that run on large data sets.
create materialized view v1 as select * from table1 where column_1 between 100 and 400;
Deciding When to Create a Materialized View or a Regular View
In general, when deciding whether to create a materialized view or a regular view, use the following criteria:
- Create a materialized view when all of the following are true:
- The query results from the view don’t change often. This almost always means that the underlying/base table for the view doesn’t change often, or at least that the subset of base table rows used in the materialized view don’t change often.
- The results of the view are used often (typically significantly more often than the query results change).
- The query consumes a lot of resources. Typically, this means that the query consumes a lot of processing time or credits, but it could also mean that the query consumes a lot of storage space for intermediate results.
- Create a regular view when any of the following are true:
- The results of the view change often.
- The results are not used often (relative to the rate at which the results change).
- The query is not resource intensive so it is not costly to re-run it.
Materialized Views and Clustering
Defining a clustering key on a materialized view is supported and can increase performance in many situations. However, it also adds costs.
If you cluster both the materialized view(s) and the base table on which the materialized view(s) are defined, you can cluster the materialized view(s) on different columns from the columns used to cluster the base table.
In most cases, clustering a subset of the materialized views on a table tends to be more cost-effective than clustering the table itself. If the data in the base table is accessed (almost) exclusively through the materialized views, and (almost) never directly through the base table, then clustering the base table adds costs without adding benefit.
Maintenance Costs for Materialized Views
Materialized views impact your costs for both storage and compute resources:
Storage: Each materialized view stores query results, which adds to the monthly storage usage for your account.
Compute resources: In order to prevent materialized views from becoming out-of-date, Snowflake performs automatic background maintenance of materialized views. When a base table changes, all materialized views defined on the table are updated by a background service that uses compute resources provided by Snowflake.
These updates can consume significant resources, resulting in increased credit usage. However, Snowflake ensures efficient credit usage by billing your account only for the actual resources used. Billing is calculated in 1-second increments.
Search Optimisation Service
The search optimization service aims to significantly improve the performance of selective point lookup queries on tables. A point lookup query returns only one or a small number of distinct rows. Use case examples include:
- Business users who need fast response times for critical dashboards with highly selective filters.
- Data scientists who are exploring large data volumes and looking for specific subsets of data.
A user can register one or more tables to the search optimization service. Search optimization is a table-level property and applies to all columns with supported data types.
-- Add ALTER TABLE [IF EXISTS] <table_name> ADD SEARCH OPTIMIZATION; -- Remove ALTER TABLE [IF EXISTS] <table_name> DROP SEARCH OPTIMIZATION;
How Does the Search Optimization Service Work?
A maintenance service that runs in the background is responsible for creating and maintaining the search access path:
- When you add search optimization to a table, the maintenance service creates and populates the search access path with the data needed to perform the lookups.
- The process of populating data can take time, depending on the size of the table. The service does this work in the background and does not block any concurrent operations on the table.
- When data in the table is updated (for example, by loading new data sets or through DML operations), the maintenance service automatically updates the search access path to reflect the changes to the data.
- If queries are run when the search access path hasn’t been updated yet, the queries might run slower but will always return up-to-date results.
However, note that there is a cost for the storage and compute resources for this service.
The following table shows which of these three optimizations have storage or compute costs:
Semi-Structured Data
When storing regular data that uses native types such as strings and integers, it is best to load them into a VARIANT column since it contains the storage and query requirements.
Data/Time Data Types
When defining columns for “date/time data types”, it is best to select a date or time timestamp data type instead of characters. This is because Snowflake stores the former data types more efficiently than the latter.
Transient Tables
Snowflake supports the creation of “transient tables” but does not keep a record of their history. This is a benefit since it saves on storage costs. Therefore, you can use transient tables as needed.
Snowflake support following table types:
- Permanent – Default
- Temporary – Session Only
- Transient – Beyond Session with no fail-safe period
- External – Datalake
Data Querying
Following Tuning practices are generally followed by Database administrators.
Scale Up
While you should never scale up to tune a specific query, it may be sensible to resize the warehouse to improve overall query performance. As scaling up adds additional servers, it spreads the workload and effectively increases the overall warehouse cache size.
While Caching is automatic behaviour, there are two best practices you can implement to maximise cache usage and speed query performance.
- Utilise same virtual warehouse for same queries: When segmenting query workload you should place users querying the same data on the same virtual warehouse. This maximises the chances that data retrieved to the cache by one user will also be used by others.
- Auto-Suspend: You should also avoid being too hasty in suspending a virtual warehouse when it’s not in use. By default, any warehouse will automatically suspend after 10 minutes, and auto-resume when a SQL statement is executed. You could set the auto-suspend to as little as a few seconds to save money, but should be aware, when resumed, the virtual warehouse cache may be clean, which means you lose the performance benefits of caching.
- Scale up for large data volumes: If you have a sequence of large queries to perform against massive (multi-terabyte) size data volumes, you can improve workload performance by scaling up. Simple execute a SQL statement to increase the virtual warehouse size, and new queries will start on the larger (faster) cluster. While this will start with a clean (empty) cache, you should normally find performance doubles at each size, and this extra performance boost will more than out-weigh the cost of refreshing the cache. However, be aware, if you scale up (or down) the data cache is cleared.
- Scale out for Concurrency: As the workload increases, jobs begin to queue as there are insufficient resources available. However, the Snowflake multi-cluster feature can be configured to automatically create another same-size virtual warehouse, and this continues to take up the load. As tasks complete, the above solution automatically scales back down to a single cluster, and once the last task finishes, the last running cluster will suspend. This is by far the most efficient method of completing batch parallel tasks, and we still have the option of scaling up.
The SQL snippet below illustrates the command needed to create a multi-cluster warehouse, which will automatically suspend after 60 seconds idle time, but use the ECONOMY scaling policy to favour throughput and saving credits over individual query latency. - Scale down – but not too soon: Once your large task has completed, you could reduce costs by scaling down or even suspending the virtual warehouse. Be aware again however, the cache will start again clean on the smaller cluster. By all means tune the warehouse size dynamically, but don’t keep adjusting it, or you’ll lose the benefit.
Cache Usage in Snowflake
Snowflake holds both a data cache in SSD in addition to a result cache to maximise SQL query performance.
- Result Cache: Which holds the results of every query executed in the past 24 hours. These are available across virtual warehouses, so query results returned to one user is available to any other user on the system who executes the same query, provided the underlying data has not changed.
- Local Disk Cache: Which is used to cache data used by SQL queries. Whenever data is needed for a given query it’s retrieved from the Remote Disk storage, and cached in SSD and memory.
- Remote Disk: Which holds the long term storage. This level is responsible for data resilience, which in the case of Amazon Web Services, means 99.999999999% durability. Even in the event of an entire data centre failure.
Benchmarking can be done using following strategy :
- Run from cold: Which meant starting a new virtual warehouse (with no local disk caching), and executing the query.
- Run from warm: Which meant disabling the result caching, and repeating the query. This makes use of the local disk caching, but not the result cache.
- Run from hot: Which again repeated the query, but with the result caching switched on.
Separate query workloads
The single most important method to maximise throughput and minimise latency on Snowflake is to segment query workloads. The diagram below illustrates what should be common design pattern of every Snowflake deployment – separation of workloads.
There is often a temptation to separate workloads by department or team, for example by giving each team their own virtual warehouses to help track usage by team. However, it’s typically best practice to separate workloads by the type of workload rather than user group. This means running Business intelligence queries from Marketing users on one warehouse, while running a separate virtual warehouse to support ultra-fast Finance dashboard queries on another.
Spilling
Any value in SPILL_TO_LOCAL or SPILL_TO_REMOTE indicates a potentially large sort of operation on a small virtual warehouse. Consider moving the query to a bigger warehouse or scaling up the existing warehouse if appropriate.
Select Only Required Columns
Avoid selecting all the columns from a table or view using a select * from. While it’s OK for ad-hoc queries, you’ll find it much faster to indicate the specific columns you need. Reason is simple, Snowflake uses a columnar data store.
Table Scan
A high value of PCT_TABLE_SCAN and a large number of MB_SCANNED indicates potential poor query selectivity on large tables. Check the query WHERE clause and consider using a cluster key if appropriate.
Don’t mess up your SQL
If someone is writing messed up sql queries then none of the above discussed tuning technique will come to rescue. Understanding SQL internals and its execution is an utmost important criteria to tune up applications. In below example Query-2 is the optimised version of Query-1 with correct usage of window function.
-- Query-1 -- requires a lot of memory (hold all the distincts in memory, you won’t know you’re done until the final row of the full dataset has been processed) -- lot of compute (hash every row, lookup the hash to remove dups) -- very hard to parallelise (either 1 node does dedup, or you need to shuffle a lot of data to distribute the effort, depends on the implementation) SELECT DISTINCT username , account_id , FIRST_VALUE(session_id) OVER (PARTITION BY account_id ORDER BY login_ts) AS session_id , FIRST_VALUE(ip_address) OVER (PARTITION BY account_id ORDER BY login_ts) AS ip_address , FIRST_VALUE(country_id) OVER (PARTITION BY account_id ORDER BY login_ts) AS country_id , FIRST_VALUE(website_entry_point) OVER (PARTITION BY account_id ORDER BY login_ts) AS website_entry_point , FIRST_VALUE(payment_platform) OVER (PARTITION BY account_id ORDER BY login_ts) AS payment_platform , FIRST_VALUE(platform_type) OVER (PARTITION BY account_id ORDER BY login_ts) AS platform_type , FIRST_VALUE(session_duration_s) OVER (PARTITION BY account_id ORDER BY login_ts) AS session_duration_s , FIRST_VALUE(number_sold_items) OVER (PARTITION BY account_id ORDER BY login_ts) AS number_sold_items , FIRST_VALUE(sales_gbp) OVER (PARTITION BY account_id ORDER BY login_ts) AS sales_gbp FROM SALES.SALES_SESSION_AGGREGATION WHERE is_meaningful_visit = True ; -- Query-2 -- require less memeory and compute -- easy to parallelise (each “partition by” combination can be processed by a different node) SELECT username , account_id , session_id , ip_address , country_id , website_entry_point , payment_platform , platform_type , session_duration_s , number_sold_items , sales_gbp FROM SALES.SALES_SESSION_AGGREGATION WHERE is_meaningful_visit = TRUE QUALIFY ROW_NUMBER() OVER (PARTITION BY account_id ORDER BY login_ts) = 1 ;
Conclusion
Snowflake Data Warehouse is a secure, scalable, and popular cloud data warehousing solution. With the tips laid out above, you now stand at a better chance of reaping the most out of Snowflake. A common mistake people make with Snowflake is naturally assuming that the solution to all problems is scaling up to a bigger Data Warehouse. In a real sense, the solutions are multifaceted and depend on various factors such as end-user-queries and transformation. This does not mean that scaling up is not a good strategy. It is suitable when you intend on improving query performance. However, before settling on scaling up, it is best to find the issue that might be affecting the performance of your Data Warehouse.
Hence it is essential to be aware of Snowflake performance tuning tactics that you can employ to boost efficiency. This article introduced you to Snowflake and its tuning techniques.
References
- https://docs.snowflake.com/en/user-guide/intro-key-concepts.html#database-storage
- https://docs.snowflake.com/en/user-guide/views-materialized.html#materialized-views-and-clustering
- https://docs.snowflake.com/en/user-guide/search-optimization-service.html#
- https://community.snowflake.com/s/article/Tuning-Snowflake
- https://hevodata.com/learn/snowflake-performance-tuning/#5
- https://www.analytics.today/blog/top-3-snowflake-performance-tuning-tactics
- https://www.analytics.today/blog/caching-in-snowflake-data-warehouse
- https://jmarquesdatabeyond.medium.com/sql-like-a-pro-query-tuning-1-ce6f6b710560
- https://github.com/lynnlangit/learn-snowflakedb
- https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html
- https://www.youtube.com/playlist?list=PLba2xJ7yxHB7SWc4Sm-Sp3uGN74ulI4pS