All posts

Calculating cost per query in Snowflake

Date
  • Ian Whitestone
    Co-founder & CEO of SELECT

For most Snowflake customers, compute costs (the charges for virtual warehouses), will make up the largest portion of the bill. To effectively reduce this spend and manage costs, high-cost queries need to be accurately identified.

Snowflake customers are billed 1 for each second that virtual warehouses are running, with a minimum 60-second charge each time one is resumed. The Snowflake UI currently provides a breakdown of cost per virtual warehouse, but doesn't attribute spend at a more granular, per-query level. This post provides a detailed overview and comparison of different ways to attribute warehouse costs to queries, along with the code required to do so.

Skip to the final SQL?

If you want to skip ahead and see the SQL implementation for the recommended approach, you can head straight to the end!

Simple approach

We'll start with a simple approach that multiplies a query's execution time with the billing rate for the warehouse it ran on. For example, say a query ran for 10 minutes on a medium size warehouse. A medium warehouse costs 4 credits per hour, and with a cost of $3 per credit 2, we'd say this query costs $2 (10/60 hours * 4 credits / hour * $3/credit).

SQL Implementation

We can implement this in SQL by leveraging the snowflake.account_usage.query_history view which contains all queries from the last year along with key metadata like the total execution time and size of the warehouse the query ran on:

WITH
warehouse_sizes AS (
SELECT 'X-Small' AS warehouse_size, 1 AS credits_per_hour UNION ALL
SELECT 'Small' AS warehouse_size, 2 AS credits_per_hour UNION ALL
SELECT 'Medium' AS warehouse_size, 4 AS credits_per_hour UNION ALL
SELECT 'Large' AS warehouse_size, 8 AS credits_per_hour UNION ALL
SELECT 'X-Large' AS warehouse_size, 16 AS credits_per_hour UNION ALL
SELECT '2X-Large' AS warehouse_size, 32 AS credits_per_hour UNION ALL
SELECT '3X-Large' AS warehouse_size, 64 AS credits_per_hour UNION ALL
SELECT '4X-Large' AS warehouse_size, 128 AS credits_per_hour
)
SELECT
qh.query_id,
qh.query_text,
qh.execution_time/(1000*60*60)*wh.credits_per_hour AS query_cost
FROM snowflake.account_usage.query_history AS qh
INNER JOIN warehouse_sizes AS wh
ON qh.warehouse_size=wh.warehouse_size
WHERE
start_time >= CURRENT_DATE - 30

This gives us an estimated query cost for each query_id. To account for the same query being run multiple times in a period, we can aggregate by the query_text:

WITH
warehouse_sizes AS (
// same as above
),
queries AS (
SELECT
qh.query_id,
qh.query_text,
qh.execution_time/(1000*60*60)*wh.credits_per_hour AS query_cost
FROM snowflake.account_usage.query_history AS qh
INNER JOIN warehouse_sizes AS wh
ON qh.warehouse_size=wh.warehouse_size
WHERE
start_time >= CURRENT_DATE - 30
)
SELECT
query_text,
SUM(query_cost) AS total_query_cost_last_30d
FROM queries
GROUP BY 1

Opportunities for improvement

While simple and easy to understand, the main pitfall with this approach is that Snowflake does not charge per second a query ran. They charge per second the warehouse is up. A given query may automatically resume the warehouse, run for 6 seconds, then cause the warehouse to idle before being automatically suspended. Snowflake bills for this idle time, and therefore it can be helpful to "charge back" this cost to the query. Similarly, if two queries run concurrently on the warehouse for the same 20 minutes, Snowflake will bill for 20 minutes, not 40. Idle time and concurrency are therefore important considerations in cost attribution and optimization efforts.

When aggregating by query_text to get total cost in the period, we grouped by the un-processed query text. In practice, it is common for the systems that created these queries to add unique metadata into each query. For example, Looker will add some context to each query. The first time a query is run, it may look like this:

SELECT
id,
created_at
FROM orders
-- Looker Query Context '{"user_id":181,"history_slug":"9dcf35a","instance_slug":"aab1f6"}'

And in the next run, this metadata will be different:

SELECT
id,
created_at
FROM orders
-- Looker Query Context '{"user_id":181,"history_slug":"1kal99e","instance_slug":"jju3q8"}'

Similar to Looker, dbt will add its own metadata, giving each query a unique invocation_id:

SELECT
id,
created_at
FROM orders
/*{
"app": "dbt",
"invocation_id": "52c47806ae6d",
"node_id": "model.jaffle_shop.orders",
...
}*/

When grouped by query_text, the two occurrences of the query above won't be linked since this metadata makes each one unique. This could result in a single and potentially easily addressed source of high cost queries (for example a dashboard) going unidentified.

We may wish to go even further and bucket costs at a higher level. dbt models often consist of multiple queries being run: a CREATE TEMPORARY TABLE followed by a MERGE statement. A given dashboard may trigger 5 different queries each time it is refreshed. Being able to group the entire collection of queries from a single origin is very useful for attributing spend and then targeting improvements in a time efficient manner.

With these opportunities in mind, can we do better?

New approach

To be able to reconcile the total attributed query costs with the final bill, it's important to start with the exact charges for each warehouse. The decision to use an hourly granularity comes from snowflake.account_usage.warehouse_metering_history, the source of truth for warehouse charges, which reports credit consumption at an hourly level. We can then calculate how many seconds each query spent executing in the hour, and allocate the credits proportionally to each query based on their fraction of the total execution time. In doing so, we will account for idle time by distributing it among the queries that ran during the period. Concurrency will also be handled since more queries running will generally lower the average cost per query.

To ground this in an example, say the TRANSFORMING_WAREHOUSE consumed 100 credits in a single hour. During that time, three queries ran, 2 for 10 minutes and 1 for 20 minutes, for 40 minutes of total execution time. In this scenario, we would allocate credits to each query in the following way:

  1. Query 1 (10 minutes) -> 25 credits
  2. Query 2 (20 minutes) -> 50 credits
  3. Query 3 (10 minutes) -> 25 credits

In the diagram below, query 3 begins between 17:00-18:00 and finishes after 18:00. To account for queries which span multiple hours, we only include the portion of the query that ran in each hour.

Calculating cost per query in Snowflake

When only one query runs in an hour, like Query 5 below, all credit consumption is attributed to that one query, including the credits consumed by the warehouse sitting idle.

Calculating cost per query in Snowflake with idle time

SQL Implementation

Some queries don't execute on a warehouse and are processed entirely by the cloud services layer. To filter those, we remove queries with warehouse_size IS NULL 3. We'll also calculate a new timestamp, execution_start_time, to denote the exact time at which the query began running on the warehouse 4.

SELECT
query_id,
query_text,
warehouse_id,
TIMEADD(
'millisecond',
queued_overload_time + compilation_time +
queued_provisioning_time + queued_repair_time +
list_external_files_time,
start_time
) AS execution_start_time,
end_time
FROM snowflake.account_usage.query_history AS q
WHERE TRUE
AND warehouse_size IS NOT NULL
AND start_time >= CURRENT_DATE - 30

Next, we need to determine how long each query ran in each hour. Say we have two queries, one that ran within the hour and one that started in one hour and ended in another.

query_idexecution_start_timeend_time
1232022-10-08 08:27:51.2342022-10-08 08:30:20.812
4562022-10-08 08:30:11.9412022-10-08 09:01:56.000

We need to generate a table with one row per hour that the query ran within.

query_idexecution_start_timeend_timehour_starthour_end
1232022-10-08 08:27:51.2342022-10-08 08:30:20.8122022-10-08 08:00:00.0002022-10-08 09:00:00.000
4562022-10-08 08:30:11.9412022-10-08 09:01:56.0002022-10-08 08:00:00.0002022-10-08 09:00:00.000
4562022-10-08 08:30:11.9412022-10-08 09:01:56.0002022-10-08 09:00:00.0002022-10-08 10:00:00.000

To accomplish this in SQL, we generate a CTE, hours_list, with 1 row per hour in the 30 day range we are looking at. Then, we perform a range join with the filtered_queries to get a CTE, query_hours, with 1 row for each hour that a query executed within.

WITH
filtered_queries AS (
SELECT
query_id,
query_text,
warehouse_id,
TIMEADD(
'millisecond',
queued_overload_time + compilation_time +
queued_provisioning_time + queued_repair_time +
list_external_files_time,
start_time
) AS execution_start_time,
end_time
FROM snowflake.account_usage.query_history AS q
WHERE TRUE
AND warehouse_size IS NOT NULL
AND start_time >= DATEADD('day', -30, DATEADD('day', -1, CURRENT_DATE))
),
hours_list AS (
SELECT
DATEADD(
'hour',
'-' || row_number() over (order by null),
DATEADD('day', '+1', CURRENT_DATE)
) as hour_start,
DATEADD('hour', '+1', hour_start) AS hour_end
FROM TABLE(generator(rowcount => (24*31))) t
),
-- 1 row per hour a query ran
query_hours AS (
SELECT
hl.hour_start,
hl.hour_end,
queries.*
FROM hours_list AS hl
INNER JOIN filtered_queries AS queries
ON hl.hour_start >= DATE_TRUNC('hour', queries.execution_start_time)
AND hl.hour_start < queries.end_time
),

Now we can calculate the number of milliseconds each query ran for within each hour along with their fraction relative to all queries by leveraging the DATEDIFF function.

query_seconds_per_hour AS (
SELECT
*,
DATEDIFF('millisecond', GREATEST(execution_start_time, hour_start), LEAST(end_time, hour_end)) AS num_milliseconds_query_ran,
SUM(num_milliseconds_query_ran) OVER (PARTITION BY warehouse_id, hour_start) AS total_query_milliseconds_in_hour,
num_milliseconds_query_ran/total_query_milliseconds_in_hour AS fraction_of_total_query_time_in_hour,
hour_start AS hour
FROM query_hours
),

Finally, we get the actual credits used from snowflake.account_usage.warehouse_metering_history and allocate them to each query according to the fraction of all execution time that query contributed. One last aggregation is performed to return the dataset back to one row per query.

credits_billed_per_hour AS (
SELECT
start_time AS hour,
warehouse_id,
credits_used_compute
FROM snowflake.account_usage.warehouse_metering_history
),
query_cost AS (
SELECT
query.*,
credits.credits_used_compute*2.28 AS actual_warehouse_cost,
credits.credits_used_compute*fraction_of_total_query_time_in_hour*2.28 AS query_allocated_cost_in_hour
FROM query_seconds_per_hour AS query
INNER JOIN credits_billed_per_hour AS credits
ON query.warehouse_id=credits.warehouse_id
AND query.hour=credits.hour
)
-- Aggregate back to 1 row per query
SELECT
query_id,
ANY_VALUE(MD5(query_text)) AS query_signature,
ANY_VALUE(query_text) AS query_text,
SUM(query_allocated_cost_in_hour) AS query_cost,
ANY_VALUE(warehouse_id) AS warehouse_id,
SUM(num_milliseconds_query_ran) / 1000 AS execution_time_s
FROM query_cost
GROUP BY 1

Processing the query text

As discussed earlier, many queries will contain custom metadata added as comments, which restrict our ability to group the same queries together. Comments in SQL can come in two forms:

  1. Single line comments starting with --
  2. Single or multi-line comments of the form /* <comment text> */
-- This is a valid SQL comment
SELECT
id,
total_price, -- So is this
created_at /* And this! */
FROM orders
/*
This is also a valid SQL comment.
Woo!
*/

Each of these comment types can be removed using Snowflake's REGEXP_REPLACE function 5.

SELECT
query_text AS original_query_text,
-- First, we remove comments enclosed by /* <comment text> */
REGEXP_REPLACE(query_text, '(/\*.*\*/)') AS _cleaned_query_text,
-- Next, removes single line comments starting with --
-- and either ending with a new line or end of string
REGEXP_REPLACE(_cleaned_query_text, '(--.*$)|(--.*\n)') AS cleaned_query_text,
FROM snowflake.account_usage.query_history AS q

Now we can aggregate by cleaned_query_text instead of the original query_text when identifying the most expensive queries in a particular timeframe. To see the final version of the SQL implementation using this cleaned_query_text, head to the appendix.

Opportunities for improvement

While this method is a great improvement over the simple approach, there are still opportunities to make it better. The credits associated with warehouse idle time are distributed across all queries that ran in a given hour. Instead, attributing idle spend only to the query or queries that directly caused it will improve the accuracy of the model, and therefore its effectiveness in guiding cost reduction efforts.

This approach also does not take into account the minimum 60-second billing charge. If there are two queries run separately in a given hour, and one takes 1 second to execute and another takes 60 seconds, the second query will appear 60 times more than expensive than the first query, even though that first query consumes 60 seconds worth of credits.

The query_text processing technique has room for improvement too. It's not uncommon for incremental data models to have hardcoded dates generated into the SQL, which change on each run. For example:

-- Query run on 2022-10-03
CREATE TEMPORARY TABLE orders AS (
SELECT
...
FROM orders
WHERE
created_at BETWEEN DATE'2022-10-01' AND DATE'2022-10-02'
)

You can also see this behaviour in parameterized dashboard queries. For example, a marketing dashboard may expose a templated query:

SELECT
id,
email
FROM customers
WHERE
country_code = {{ selected_country_code }}
AND signup_date >= CURRENT_DATE - {{ signup_days_back }}

Each time this same query is run, it is populated with different values:

SELECT
id,
email
FROM customers
WHERE
country_code = 'CA'
AND signup_date >= CURRENT_DATE - 90

While the parameterized queries can be handled with more advanced SQL text processing, idle and minimum billing times are trickier. At the end of the day, the purpose of attributing warehouse costs to queries is to help users determine where they should focus their time. With this current approach, we strongly believe it will let you accomplish this goal. All models are wrong, but some are useful.

Planned future enhancements

In addition to the more advanced SQL text processing discussed above, there are a few other enhancements we plan to make to this approach:

  • If cloud service credits exceed 10% of your daily compute credits, Snowflake will begin to charge you for them. To improve the robustness of this model, we need to account for the cloud services credits associated with each query that ran in a warehouse, as well as the queries that did not run in any warehouse. Simple queries like SHOW TABLES that only run in cloud services can end up consuming credits if they are executed very frequently. See this post on how Metabase metadata queries were costing $500/month in cloud services credits.
  • Extend the model to calculate cost per data asset, rather than cost per query. To calculate cost per DBT model, this will involve parsing the dbt JSON metadata automatically injected into each SQL query generated by dbt. It could also involve connecting to BI tool metadata to calculate things like "cost per dashboard".
  • We plan to bundle this code into a new dbt package so users can easily get greater visibility into their Snowflake spend

How to identify expensive queries

Once you've calculated the cost per query and stored it in a new table (i.e. query_history_enriched), you can quickly identify the top 100 most expensive queries in your account by running the following query:

with
max_date as (
select max(date(end_time)) as date
from query_history_enriched
)
select
md5(query_parameterized_hash) as query_parameterized_hash,
sum(query_cost) as total_cost_last_30d,
total_cost_last_30d*12 as estimated_annual_cost,
max_by(query_text, start_time) as latest_query_text,
max_by(warehouse_name, start_time) as latest_warehouse_name,
max_by(warehouse_size, start_time) as latest_warehouse_size,
max_by(query_id, start_time) as latest_query_id,
avg(execution_time/1000) as avg_execution_time_s,
count(*) as num_executions
from query_history_enriched
cross join max_date
where
start_time >= dateadd('day', -30, max_date.date)
and start_time < max_date.date -- don't include partial day of data
group by 1
order by total_cost_last_30d desc
limit 100

Notes

1 Snowflake uses the concept of credits for most of its billable services. When warehouses are running, they consume credits. The rate at which credits are consumed doubles each time the warehouse size is increased. An X-Small warehouse costs 1 credit per hour, a small costs 2 credits per hour, a medium costs 4 credits per hour, etc. Each Snowflake customer will pay a fixed rate per credit, which is how the final dollar value on the monthly bill is calculated.

2 The cost per credit will vary based on the plan you are on (Standard, Enterprise, Business Critical, etc..) and your contract. On demand customers will generally pay $2/credit for Standard, and $3/credit on Enterprise. If you sign an annual contract with Snowflake, this rate will get discounted based on how many credits you purchase up front. All examples here are in US dollars.

3 It is possible for queries to run without a warehouse by leveraging the metadata in cloud services.

4 There are a number of things that need to happen before a query can begin executing in a warehouse, such as query compilation in cloud services and warehouse provisioning. In a future post we'll dive deep into the lifecycle of a Snowflake query.

5 ⚠️, the REGEX '(/\*.*\*/)' won't work for two comments on the same line, such as /* hi */SELECT * FROM table/* hello there */

Appendix

Complete SQL Query

For a Snowflake account with ~9 million queries per month, the query below took 93 seconds on an X-Small warehouse.

WITH
filtered_queries AS (
SELECT
query_id,
query_text AS original_query_text,
-- First, we remove comments enclosed by /* <comment text> */
REGEXP_REPLACE(query_text, '(/\*.*\*/)') AS _cleaned_query_text,
-- Next, removes single line comments starting with --
-- and either ending with a new line or end of string
REGEXP_REPLACE(_cleaned_query_text, '(--.*$)|(--.*\n)') AS cleaned_query_text,
warehouse_id,
TIMEADD(
'millisecond',
queued_overload_time + compilation_time +
queued_provisioning_time + queued_repair_time +
list_external_files_time,
start_time
) AS execution_start_time,
end_time
FROM snowflake.account_usage.query_history AS q
WHERE TRUE
AND warehouse_size IS NOT NULL
AND start_time >= DATEADD('day', -30, DATEADD('day', -1, CURRENT_DATE))
),
-- 1 row per hour from 30 days ago until the end of today
hours_list AS (
SELECT
DATEADD(
'hour',
'-' || row_number() over (order by null),
DATEADD('day', '+1', CURRENT_DATE)
) as hour_start,
DATEADD('hour', '+1', hour_start) AS hour_end
FROM TABLE(generator(rowcount => (24*31))) t
),
-- 1 row per hour a query ran
query_hours AS (
SELECT
hl.hour_start,
hl.hour_end,
queries.*
FROM hours_list AS hl
INNER JOIN filtered_queries AS queries
ON hl.hour_start >= DATE_TRUNC('hour', queries.execution_start_time)
AND hl.hour_start < queries.end_time
),
query_seconds_per_hour AS (
SELECT
*,
DATEDIFF('millisecond', GREATEST(execution_start_time, hour_start), LEAST(end_time, hour_end)) AS num_milliseconds_query_ran,
SUM(num_milliseconds_query_ran) OVER (PARTITION BY warehouse_id, hour_start) AS total_query_milliseconds_in_hour,
num_milliseconds_query_ran/total_query_milliseconds_in_hour AS fraction_of_total_query_time_in_hour,
hour_start AS hour
FROM query_hours
),
credits_billed_per_hour AS (
SELECT
start_time AS hour,
warehouse_id,
credits_used_compute
FROM snowflake.account_usage.warehouse_metering_history
),
query_cost AS (
SELECT
query.*,
credits.credits_used_compute*2.28 AS actual_warehouse_cost,
credits.credits_used_compute*fraction_of_total_query_time_in_hour*2.28 AS query_allocated_cost_in_hour
FROM query_seconds_per_hour AS query
INNER JOIN credits_billed_per_hour AS credits
ON query.warehouse_id=credits.warehouse_id
AND query.hour=credits.hour
),
cost_per_query AS (
SELECT
query_id,
ANY_VALUE(MD5(cleaned_query_text)) AS query_signature,
SUM(query_allocated_cost_in_hour) AS query_cost,
ANY_VALUE(original_query_text) AS original_query_text,
ANY_VALUE(warehouse_id) AS warehouse_id,
SUM(num_milliseconds_query_ran) / 1000 AS execution_time_s
FROM query_cost
GROUP BY 1
)
SELECT
query_signature,
COUNT(*) AS num_executions,
AVG(query_cost) AS avg_cost_per_execution,
SUM(query_cost) AS total_cost_last_30d,
ANY_VALUE(original_query_text) AS sample_query_text
FROM cost_per_query
GROUP BY 1

Alternative approach considered

Before landing on the final approach presented above, an approach that more accurately handled concurrency and idle time was considered, especially across multi-cluster warehouses. Instead of working from the actual credits charged per hour, this approach leveraged the snowflake.account_usage.warehouse_events_history view to construct a dataset with 1 row per second each warehouse cluster was active. Using this dataset, along with the knowledge of which query ran on which warehouse cluster, it's possible to more accurately attribute credits to each set of queries, as shown in the diagram below.

Alternative, more accurate approach for calculating cost per query in Snowflake

Unfortunately, it was discovered that the warehouse_events_history does not give a perfect representation of when each warehouse cluster was active, so this approach was abandoned.`

Ian Whitestone
Co-founder & CEO of SELECT
Ian is the Co-founder & CEO of SELECT, a SaaS Snowflake cost management and optimization platform. Prior to starting SELECT, Ian spent 6 years leading full stack data science & engineering teams at Shopify and Capital One. At Shopify, Ian led the efforts to optimize their data warehouse and increase cost observability.

Get up and running with SELECT in 15 minutes.

Snowflake optimization & cost management platform

Gain visibility into Snowflake usage, optimize performance and automate savings with the click of a button.

SELECT web application screenshot

Want to hear about our latest Snowflake learnings? 
Subscribe to get notified.