Calculating cost per query in Snowflake

Date: Sunday, April 14, 2024

Ian Whitestone
Co-founder & CEO of SELECT

For most Snowflake customers, compute costs (the charges for virtual warehouses), will make up the largest portion of the bill. To effectively reduce this spend and manage costs, high-cost queries need to be accurately identified.

Snowflake customers are billed¹ for each second that virtual warehouses are running, with a minimum 60-second charge each time one is resumed. The Snowflake UI currently provides a breakdown of cost per virtual warehouse, but doesn't attribute spend at a more granular, per-query level. This post provides a detailed overview and comparison of different ways to attribute warehouse costs to queries, along with the code required to do so.

If you want to skip ahead and see the SQL implementation for the recommended approach, you can head straight to the end!

Simple approach

We'll start with a simple approach that multiplies a query's execution time with the billing rate for the warehouse it ran on. For example, say a query ran for 10 minutes on a medium size warehouse. A medium warehouse costs 4 credits per hour, and with a cost of $3 per credit², we'd say this query costs $2 (10/60 hours * 4 credits / hour * $3/credit).

SQL Implementation

We can implement this in SQL by leveraging the snowflake.account_usage.query_history view which contains all queries from the last year along with key metadata like the total execution time and size of the warehouse the query ran on:

1WITH
2warehouse_sizes AS (
3    SELECT 'X-Small' AS warehouse_size, 1 AS credits_per_hour UNION ALL
4    SELECT 'Small' AS warehouse_size, 2 AS credits_per_hour UNION ALL
5    SELECT 'Medium'  AS warehouse_size, 4 AS credits_per_hour UNION ALL
6    SELECT 'Large' AS warehouse_size, 8 AS credits_per_hour UNION ALL
7    SELECT 'X-Large' AS warehouse_size, 16 AS credits_per_hour UNION ALL
8    SELECT '2X-Large' AS warehouse_size, 32 AS credits_per_hour UNION ALL
9    SELECT '3X-Large' AS warehouse_size, 64 AS credits_per_hour UNION ALL
10    SELECT '4X-Large' AS warehouse_size, 128 AS credits_per_hour
11)
12SELECT
13    qh.query_id,
14    qh.query_text,
15    qh.execution_time/(1000*60*60)*wh.credits_per_hour AS query_cost

This gives us an estimated query cost for each query_id. To account for the same query being run multiple times in a period, we can aggregate by the query_text:

1WITH
2warehouse_sizes AS (
3    // same as above
4),
5queries AS (
6    SELECT
7        qh.query_id,
8        qh.query_text,
9        qh.execution_time/(1000*60*60)*wh.credits_per_hour AS query_cost
10    FROM snowflake.account_usage.query_history AS qh
11    INNER JOIN warehouse_sizes AS wh
12        ON qh.warehouse_size=wh.warehouse_size
13    WHERE
14        start_time >= CURRENT_DATE - 30
15)

Opportunities for improvement

While simple and easy to understand, the main pitfall with this approach is that Snowflake does not charge per second a query ran. They charge per second the warehouse is up. A given query may automatically resume the warehouse, run for 6 seconds, then cause the warehouse to idle before being automatically suspended. Snowflake bills for this idle time, and therefore it can be helpful to "charge back" this cost to the query. Similarly, if two queries run concurrently on the warehouse for the same 20 minutes, Snowflake will bill for 20 minutes, not 40. Idle time and concurrency are therefore important considerations in cost attribution and optimization efforts.

When aggregating by query_text to get total cost in the period, we grouped by the un-processed query text. In practice, it is common for the systems that created these queries to add unique metadata into each query. For example, Looker will add some context to each query. The first time a query is run, it may look like this:

1SELECT
2    id,
3    created_at
4FROM orders
5-- Looker Query Context '{"user_id":181,"history_slug":"9dcf35a","instance_slug":"aab1f6"}'

And in the next run, this metadata will be different:

1SELECT
2    id,
3    created_at
4FROM orders
5-- Looker Query Context '{"user_id":181,"history_slug":"1kal99e","instance_slug":"jju3q8"}'

Similar to Looker, dbt will add its own metadata, giving each query a unique invocation_id:

1SELECT
2    id,
3    created_at
4FROM orders
5/*{
6    "app": "dbt",
7    "invocation_id": "52c47806ae6d",
8    "node_id": "model.jaffle_shop.orders",
9    ...
10}*/

When grouped by query_text, the two occurrences of the query above won't be linked since this metadata makes each one unique. This could result in a single and potentially easily addressed source of high cost queries (for example a dashboard) going unidentified.

We may wish to go even further and bucket costs at a higher level. dbt models often consist of multiple queries being run: a CREATE TEMPORARY TABLE followed by a MERGE statement. A given dashboard may trigger 5 different queries each time it is refreshed. Being able to group the entire collection of queries from a single origin is very useful for attributing spend and then targeting improvements in a time efficient manner.

With these opportunities in mind, can we do better?

New approach

To be able to reconcile the total attributed query costs with the final bill, it's important to start with the exact charges for each warehouse. The decision to use an hourly granularity comes from snowflake.account_usage.warehouse_metering_history, the source of truth for warehouse charges, which reports credit consumption at an hourly level. We can then calculate how many seconds each query spent executing in the hour, and allocate the credits proportionally to each query based on their fraction of the total execution time. In doing so, we will account for idle time by distributing it among the queries that ran during the period. Concurrency will also be handled since more queries running will generally lower the average cost per query.

To ground this in an example, say the TRANSFORMING_WAREHOUSE consumed 100 credits in a single hour. During that time, three queries ran, 2 for 10 minutes and 1 for 20 minutes, for 40 minutes of total execution time. In this scenario, we would allocate credits to each query in the following way:

Query 1 (10 minutes) -> 25 credits
Query 2 (20 minutes) -> 50 credits
Query 3 (10 minutes) -> 25 credits

In the diagram below, query 3 begins between 17:00-18:00 and finishes after 18:00. To account for queries which span multiple hours, we only include the portion of the query that ran in each hour.

When only one query runs in an hour, like Query 5 below, all credit consumption is attributed to that one query, including the credits consumed by the warehouse sitting idle.

Calculating cost per query in Snowflake with idle time

SQL Implementation

Some queries don't execute on a warehouse and are processed entirely by the cloud services layer. To filter those, we remove queries with warehouse_size IS NULL³. We'll also calculate a new timestamp, execution_start_time, to denote the exact time at which the query began running on the warehouse⁴.

1SELECT
2    query_id,
3    query_text,
4    warehouse_id,
5    TIMEADD(
6        'millisecond',
7        queued_overload_time + compilation_time +
8        queued_provisioning_time + queued_repair_time +
9        list_external_files_time,
10        start_time
11    ) AS execution_start_time,
12    end_time
13FROM snowflake.account_usage.query_history AS q
14WHERE TRUE
15    AND warehouse_size IS NOT NULL
16    AND start_time >= CURRENT_DATE - 30

Next, we need to determine how long each query ran in each hour. Say we have two queries, one that ran within the hour and one that started in one hour and ended in another.

query_id	execution_start_time	end_time
123	2022-10-08 08:27:51.234	2022-10-08 08:30:20.812
456	2022-10-08 08:30:11.941	2022-10-08 09:01:56.000

We need to generate a table with one row per hour that the query ran within.

query_id	execution_start_time	end_time	hour_start	hour_end
123	2022-10-08 08:27:51.234	2022-10-08 08:30:20.812	2022-10-08 08:00:00.000	2022-10-08 09:00:00.000
456	2022-10-08 08:30:11.941	2022-10-08 09:01:56.000	2022-10-08 08:00:00.000	2022-10-08 09:00:00.000
456	2022-10-08 08:30:11.941	2022-10-08 09:01:56.000	2022-10-08 09:00:00.000	2022-10-08 10:00:00.000

To accomplish this in SQL, we generate a CTE, hours_list, with 1 row per hour in the 30 day range we are looking at. Then, we perform a range join with the filtered_queries to get a CTE, query_hours, with 1 row for each hour that a query executed within.

1WITH
2filtered_queries AS (
3    SELECT
4        query_id,
5        query_text,
6        warehouse_id,
7        TIMEADD(
8            'millisecond',
9            queued_overload_time + compilation_time +
10            queued_provisioning_time + queued_repair_time +
11            list_external_files_time,
12            start_time
13        ) AS execution_start_time,
14        end_time
15    FROM snowflake.account_usage.query_history AS q
16    WHERE TRUE
17        AND warehouse_size IS NOT NULL
18        AND start_time >= DATEADD('day', -30, DATEADD('day', -1, CURRENT_DATE))
19),
20hours_list AS (
21    SELECT
22        DATEADD(
23            'hour',
24            '-' || row_number() over (order by null),
25            DATEADD('day', '+1', CURRENT_DATE)
26        ) as hour_start,
27        DATEADD('hour', '+1', hour_start) AS hour_end
28    FROM TABLE(generator(rowcount => (24*31))) t
29),
30-- 1 row per hour a query ran
31query_hours AS (
32    SELECT
33        hl.hour_start,
34        hl.hour_end,
35        queries.*
36    FROM hours_list AS hl
37    INNER JOIN filtered_queries AS queries
38        ON hl.hour_start >= DATE_TRUNC('hour', queries.execution_start_time)
39        AND hl.hour_start < queries.end_time
40),

Now we can calculate the number of milliseconds each query ran for within each hour along with their fraction relative to all queries by leveraging the DATEDIFF function.

1query_seconds_per_hour AS (
2    SELECT
3        *,
4        DATEDIFF('millisecond', GREATEST(execution_start_time, hour_start), LEAST(end_time, hour_end)) AS num_milliseconds_query_ran,
5        SUM(num_milliseconds_query_ran) OVER (PARTITION BY warehouse_id, hour_start) AS total_query_milliseconds_in_hour,
6        num_milliseconds_query_ran/total_query_milliseconds_in_hour AS fraction_of_total_query_time_in_hour,
7        hour_start AS hour
8    FROM query_hours
9),

Finally, we get the actual credits used from snowflake.account_usage.warehouse_metering_history and allocate them to each query according to the fraction of all execution time that query contributed. One last aggregation is performed to return the dataset back to one row per query.

1credits_billed_per_hour AS (
2    SELECT
3        start_time AS hour,
4        warehouse_id,
5        credits_used_compute
6    FROM snowflake.account_usage.warehouse_metering_history
7),
8query_cost AS (
9    SELECT
10        query.*,
11        credits.credits_used_compute*2.28 AS actual_warehouse_cost,
12        credits.credits_used_compute*fraction_of_total_query_time_in_hour*2.28 AS query_allocated_cost_in_hour
13    FROM query_seconds_per_hour AS query
14    INNER JOIN credits_billed_per_hour AS credits
15        ON query.warehouse_id=credits.warehouse_id
16        AND query.hour=credits.hour
17)
18-- Aggregate back to 1 row per query
19SELECT
20    query_id,
21    ANY_VALUE(MD5(query_text)) AS query_signature,
22    ANY_VALUE(query_text) AS query_text,
23    SUM(query_allocated_cost_in_hour) AS query_cost,
24    ANY_VALUE(warehouse_id) AS warehouse_id,
25    SUM(num_milliseconds_query_ran) / 1000 AS execution_time_s
26FROM query_cost
27GROUP BY 1

Processing the query text

As discussed earlier, many queries will contain custom metadata added as comments, which restrict our ability to group the same queries together. Comments in SQL can come in two forms:

Single line comments starting with --
Single or multi-line comments of the form /* <comment text> */

1-- This is a valid SQL comment
2SELECT
3    id,
4    total_price, -- So is this
5    created_at /* And this! */
6FROM orders
7
8/*
9This is also a valid SQL comment.
10Woo!
11*/

Each of these comment types can be removed using Snowflake's REGEXP_REPLACE function5.

1SELECT
2    query_text AS original_query_text,
3
4    -- First, we remove comments enclosed by /* <comment text> */
5    REGEXP_REPLACE(query_text, '(/\*.*\*/)') AS _cleaned_query_text,
6    -- Next, removes single line comments starting with --
7    -- and either ending with a new line or end of string
8    REGEXP_REPLACE(_cleaned_query_text, '(--.*$)|(--.*\n)') AS cleaned_query_text,
9FROM snowflake.account_usage.query_history AS q

Now we can aggregate by cleaned_query_text instead of the original query_text when identifying the most expensive queries in a particular timeframe. To see the final version of the SQL implementation using this cleaned_query_text, head to the appendix.

Opportunities for improvement

While this method is a great improvement over the simple approach, there are still opportunities to make it better. The credits associated with warehouse idle time are distributed across all queries that ran in a given hour. Instead, attributing idle spend only to the query or queries that directly caused it will improve the accuracy of the model, and therefore its effectiveness in guiding cost reduction efforts.

This approach also does not take into account the minimum 60-second billing charge. If there are two queries run separately in a given hour, and one takes 1 second to execute and another takes 60 seconds, the second query will appear 60 times more than expensive than the first query, even though that first query consumes 60 seconds worth of credits.

The query_text processing technique has room for improvement too. It's not uncommon for incremental data models to have hardcoded dates generated into the SQL, which change on each run. For example:

1-- Query run on 2022-10-03
2CREATE TEMPORARY TABLE orders AS (
3    SELECT
4        ...
5    FROM orders
6    WHERE
7        created_at BETWEEN DATE'2022-10-01' AND DATE'2022-10-02'
8)

You can also see this behaviour in parameterized dashboard queries. For example, a marketing dashboard may expose a templated query:

1SELECT
2    id,
3    email
4FROM customers
5WHERE
6    country_code = {{ selected_country_code }}
7    AND signup_date >= CURRENT_DATE - {{ signup_days_back }}

Each time this same query is run, it is populated with different values:

1SELECT
2    id,
3    email
4FROM customers
5WHERE
6    country_code = 'CA'
7    AND signup_date >= CURRENT_DATE - 90

While the parameterized queries can be handled with more advanced SQL text processing, idle and minimum billing times are trickier. At the end of the day, the purpose of attributing warehouse costs to queries is to help users determine where they should focus their time. With this current approach, we strongly believe it will let you accomplish this goal. All models are wrong, but some are useful.

Planned future enhancements

In addition to the more advanced SQL text processing discussed above, there are a few other enhancements we plan to make to this approach:

If cloud service credits exceed 10% of your daily compute credits, Snowflake will begin to charge you for them. To improve the robustness of this model, we need to account for the cloud services credits associated with each query that ran in a warehouse, as well as the queries that did not run in any warehouse. Simple queries like SHOW TABLES that only run in cloud services can end up consuming credits if they are executed very frequently. See this post on how Metabase metadata queries were costing $500/month in cloud services credits.
Extend the model to calculate cost per data asset, rather than cost per query. To calculate cost per DBT model, this will involve parsing the dbt JSON metadata automatically injected into each SQL query generated by dbt. It could also involve connecting to BI tool metadata to calculate things like "cost per dashboard".
We plan to bundle this code into a new dbt package so users can easily get greater visibility into their Snowflake spend

How to identify expensive queries

Once you've calculated the cost per query and stored it in a new table (i.e. query_history_enriched), you can quickly identify the top 100 most expensive queries in your account by running the following query:

1with
2max_date as (
3    select max(date(end_time)) as date
4    from query_history_enriched
5)
6select
7    md5(query_parameterized_hash) as query_parameterized_hash,
8    sum(query_cost) as total_cost_last_30d,
9    total_cost_last_30d*12 as estimated_annual_cost,
10    max_by(query_text, start_time) as latest_query_text,
11    max_by(warehouse_name, start_time) as latest_warehouse_name,
12    max_by(warehouse_size, start_time) as latest_warehouse_size,
13    max_by(query_id, start_time) as latest_query_id,
14    avg(execution_time/1000) as avg_execution_time_s,
15    count(*) as num_executions

Notes

¹Snowflake uses the concept of credits for most of its billable services. When warehouses are running, they consume credits. The rate at which credits are consumed doubles each time the warehouse size is increased. An X-Small warehouse costs 1 credit per hour, a small costs 2 credits per hour, a medium costs 4 credits per hour, etc. Each Snowflake customer will pay a fixed rate per credit, which is how the final dollar value on the monthly bill is calculated. ↩

²The cost per credit will vary based on the plan you are on (Standard, Enterprise, Business Critical, etc..) and your contract. On demand customers will generally pay $2/credit for Standard, and $3/credit on Enterprise. If you sign an annual contract with Snowflake, this rate will get discounted based on how many credits you purchase up front. All examples here are in US dollars. ↩

³It is possible for queries to run without a warehouse by leveraging the metadata in cloud services. ↩

⁴There are a number of things that need to happen before a query can begin executing in a warehouse, such as query compilation in cloud services and warehouse provisioning. In a future post we'll dive deep into the lifecycle of a Snowflake query. ↩

⁵⚠️, the REGEX '(/\*.*\*/)' won't work for two comments on the same line, such as /* hi */SELECT * FROM table/* hello there */ ↩

Appendix

Complete SQL Query

For a Snowflake account with ~9 million queries per month, the query below took 93 seconds on an X-Small warehouse.

1WITH
2filtered_queries AS (
3    SELECT
4        query_id,
5        query_text AS original_query_text,
6
7        -- First, we remove comments enclosed by /* <comment text> */
8        REGEXP_REPLACE(query_text, '(/\*.*\*/)') AS _cleaned_query_text,
9        -- Next, removes single line comments starting with --
10        -- and either ending with a new line or end of string
11        REGEXP_REPLACE(_cleaned_query_text, '(--.*$)|(--.*\n)') AS cleaned_query_text,
12        warehouse_id,
13        TIMEADD(
14            'millisecond',
15            queued_overload_time + compilation_time +

Alternative approach considered

Before landing on the final approach presented above, an approach that more accurately handled concurrency and idle time was considered, especially across multi-cluster warehouses. Instead of working from the actual credits charged per hour, this approach leveraged the snowflake.account_usage.warehouse_events_history view to construct a dataset with 1 row per second each warehouse cluster was active. Using this dataset, along with the knowledge of which query ran on which warehouse cluster, it's possible to more accurately attribute credits to each set of queries, as shown in the diagram below.

Alternative, more accurate approach for calculating cost per query in Snowflake

Unfortunately, it was discovered that the warehouse_events_history does not give a perfect representation of when each warehouse cluster was active, so this approach was abandoned.

Calculating cost per query in Snowflake

Simple approach

SQL Implementation

Opportunities for improvement

New approach

SQL Implementation

Processing the query text

Opportunities for improvement

Planned future enhancements

How to identify expensive queries

Notes

Appendix

Complete SQL Query

Alternative approach considered

Optimize your Snowflake usage

Snowflake optimization & cost management platform