Essential Snowflake Optimization Strategies
- Niall Woodward
Snowflake's huge popularity is driven by its ability to process large volumes of data at extremely low latency with minimal configuration. As a result, it is an established favorite of data teams across thousands of organizations. In this guide, we share optimization techniques to maximize the performance and efficiency of Snowflake. Follow these best practices to make queries run faster while reducing costs.
The optimization techniques in this post fall into three categories:
- Improve data read efficiency
- Improve data processing efficiency
- Optimize warehouse configuration
The techniques assume a good understanding of Snowflake's architecture. Check out our Snowflake architecture overview for a quick refresher if needed.
Before diving into optimizations, let's first remind ourselves how to identify what's slowing a query down.
Finding query bottlenecks
Use Snowflake's Query Profile, and look to the 'Most Expensive Nodes' section:
In this example, we can see the bottleneck is the Sort step, which would indicate that we should focus on improving the data processing efficiency, and possibly increase the warehouse size. If a query's most expensive node or nodes are TableScans, efforts will be best spent optimizing the data read efficiency of the query. If you've come to this post working on a specific query, feel free to skip to the relevant section of the post.
Performance Optimization Strategies
1. Improve Data Read Efficiency
Queries can sometimes spend considerable time reading data from table storage. This step of the query is shown as a TableScan in the query profile. A TableScan involves downloading data over the network from the table's storage location into the virtual warehouse's worker nodes. This process can be sped up by reducing the volume of data downloaded, or increasing the virtual warehouse's size.
Snowflake only reads the columns which are selected in a query, and the micro-partitions relevant to a query's filters - provided that the table's micro-partitions are well-clustered on the filter condition. To learn more about micro-partitions, read our previous post on Snowflake micro-partitions.
The three techniques to reduce the data downloaded by a query and therefore speed up TableScans are:
- Reduce the number of columns accessed
- Leverage pruning to reduce the number of micro-partitions scanned by a query
- Use pre-aggregated tables
Reduce the number of columns accessed
This is a simple one, but where it's possible, it makes a big difference. Query requirements can change over time, and columns that were once useful may no longer be needed or used by downstream processes. Snowflake stores data in a hybrid-columnar file format called micro-partitions to store table data. This format enables Snowflake to reduce the amount of data that has to be read from storage. The process of downloading micro-partition data is called scanning, and reducing the number of columns results in less data transfer over the network.
Leverage pruning to reduce the number of micro-partitions scanned by a query
To determine whether pruning performance can be improved, take a look at the query profile's 'Partitions scanned' and 'Partitions total' statistics.
If you're not using a where clause filtering in a query, adding one may speed up the TableScan significantly (and downstream nodes too as they process less data). If your query already has where clause filtering but the 'Partitions scanned' are close to the 'Partitions total', this means that the where clause is not being effectively pruned on.
Improve pruning by:
- Ensuring where clauses are placed as early in queries as possible, otherwise they may not be 'pushed down' onto the TableScan step (this also speeds up later steps in the query)
- Adding well-clustered columns to join and merge conditions which can be pushed down as JoinFilters to enabling pruning
- Making sure that columns used in where filters of a query align with the table's clustering (learn more about clustering here)
- Avoiding the use of functions in where conditions - these often prevent Snowflake from pruning micro-partitions
Use pre-aggregated tables
Create 'rollup' or 'derived' tables that contain fewer rows of data. Pre-aggregated tables can often be designed to provide the information most queries require while using less storage space. This makes them much faster to query. For retail businesses, a common strategy is to use a daily orders rollup table for financial and stock reporting, with the raw orders table only queried where per-order granularity is needed.
2. Improve Data Processing Efficiency
Operations like Joins, Sorts, Aggregates occur downstream of TableScans and can often become the bottleneck in queries. Strategies to optimize data processing include reducing the number of query steps, incrementally processing data, and using your knowledge of the data to improve performance.
- Simplify and reduce the number of query operations
- Reduce the volume of data processed
- Use your knowledge of the data to help Snowflake process it efficiently
Simplify and reduce the number of query operations
Each operation in a query takes time to move data around between worker threads. Consolidating and removing unnecessary operations reduces the amount of network transfer required to execute a query. It also helps Snowflake reuse computations and save additional work. Most of the time, CTEs and subqueries do not impact performance, so use them to help with readability.
Reduce the volume of data processed
The less data there is, the faster each data processing step completes. Reducing both the number of columns and rows processed by each step in a query will improve performance.
For transformation queries that write to another table, a powerful way of reducing the volume of data processed is incrementalization. For the example of the orders table, we could configure the query to only process new or updated orders, and merge those results into the existing table.
Use your knowledge of the data to help Snowflake process it efficiently
Your own knowledge of the data can be used to improve query performance. For example, if a query groups on many columns and you know that some of the columns are redundant as the others already represent the same or a higher granularity, it may be faster to remove those columns from the group by and join them back in in a separate step.
If a grouped or joined column is heavily skewed (meaning a small number of distinct values occur most frequently), this can have a detrimental impact on Snowflake's speed. A common example is grouping by a column that contains a significant number of null values. Filtering rows with these values out and processing them in a separate operation can result in faster query speeds.
Finally, range joins can be slow in all data warehouses including Snowflake. Your knowledge of the interval lengths in the data can be used to reduce the join explosion that occurs. Check out our recent post on this if you're seeing slow range join performance.
3. Optimize Virtual Warehouse Configuration
Snowflake's virtual warehouses can be easily configured to support larger and higher concurrency workloads. Key configurations which impact performance are:
- Clusters (Enterprise edition and above)
- Scaling Policy (Enterprise edition and above)
Warehouse size determines the total computational power available to queries running on the warehouse, also known as vertical scaling.
Increase virtual warehouse size when:
- Queries are spilling to remote disk (identifiable via the query profile)
- Query results are needed faster (typically for user-facing applications)
Queries that spill to remote disk run inefficiently due to the large volumes of network traffic between the warehouse executing the query, and the remote disk that store data used in executing the query. Increasing the warehouse size doubles both the available RAM and local disk, which are significantly faster to access than remote disk. Where remote disk spillage occurs, increasing the warehouse size can more than double a query's speed. We've gone into more detail on Snowflake warehouse sizing in the past, and covered how to configure warehouse sizes in dbt too.
Auto-suspend is a virtual warehouse feature that automatically suspends warehouses once they have stopped processing queries after a period determined by the
AUTO_SUSPEND warehouse parameter. The default for warehouses created in the Snowsight UI is 10 minutes.
We recommend setting the auto-suspend parameter to 1 minute, which is the minimum billing time for a warehouse. The reasons for increasing this value include potential latency for the first query on a suspended warehouse while it resumes, and that the warehouse cache may be lost when a warehouse suspends.
On the first, we've rarely seen warehouse resuming take longer than a couple of seconds, which is acceptable on occasion for even user-facing applications. If this isn't acceptable, however, a longer auto-suspend threshold may be required.
The warehouse cache is data stored on the local disk of the warehouse. When the warehouse suspends, Snowflake doesn't guarantee that the cache will persist when the warehouse is resumed. The impact of cache loss is that queries have to re-scan data from table storage, rather than reading it from the much faster local cache. This cache is separate from the global result cache however, that can return results for identical queries executed within 24 hours provided that data in the queried tables is the same. If warehouse cache loss is impacting queries, increasing the auto-suspend threshold will help.
Clusters (Enterprise edition and above)
Clusters can be used to create more instances of the same size warehouse.
If there are periods where warehouse queuing causes queries to not meet their required processing speeds, consider using multi-clustering or increasing the maximum cluster count in a warehouse. This will allow the warehouse to track query volumes by adding or removing clusters.
Unlike warehouse cluster count, Snowflake cannot automatically adjust the size of virtual warehouses with query volumes. This makes multi-cluster warehouses more cost-effective for processing volatile query volumes, as each cluster is only billable while in an active state.
Scaling Policy (Enterprise edition and above)
Snowflake offers two scaling policies - Standard and Economy. For all virtual warehouses which serve user-facing queries, use the Standard scaling policy. If you're very cost-conscious, experiment with the Economy scaling policy for queueing-tolerant workloads such as data loading to see if it reduces cost while maintaining the required throughput. Otherwise, we recommend using Standard for all warehouses.
We've covered performance optimization strategies across data reading, processing, and warehouse configuration. We've already gone into some of these topics in more detail in the posts linked throughout, and will continue to do so in future posts. If you'd like to keep up to date and be notified as we release new content, subscribe to our newsletter below.