Using the Snowflake Query History: 9 Practical Examples
- Date
- Ian WhitestoneCo-founder & CEO of SELECT
One of my favorite aspects of Snowflake is all the metadata they make programmatically accessible to all users, at no extra cost! The Snowflake database is packed with hundreds of different views users can access via SQL to answer practically any question about their Snowflake usage.
The Query History dataset is one of the most widely used from the Snowflake database, as it contains a bunch of useful metadata about every query run in your account. The Query History can power a variety of different use cases like analyzing historical query runtimes, finding expensive queries or analyzing query failure rates.
In this post, I’ll share 9 real-world examples you can run in your account, today.
Accessing the Query History
Before jumping into the examples of how to use the query history, let’s quickly cover the 3 different ways you can access it.
Through the Snowsight UI
The easiest way to get started with using the Query History dataset is through the Snowsight UI, under the Monitoring tab.
From the UI, you can apply several different filters, and view additional columns.
It’s important to note that the UI only contains queries from the last 14 days:
The UI is most helpful for quickly checking queries that recently ran or are currently running. Once your use case requires more flexible filtering or additional data, users will typically move to one of the next two methods.
Information Schema Table Functions (information_schema.query_history)
The same data shown in the UI is available via the information_schema.query_history()
table function, but you can only access data from the last 7 days. On the plus side, you can retrieve data for queries that are currently running!
Here’s an example to retrieve the last 100 queries run in your account:
select *
from table(snowflake.information_schema.query_history())
order by start_time
;
If you want to return more, you can return up to 10000 with the RESULT_LIMIT
argument (the default is 100):
select *
from table(snowflake.information_schema.query_history(result_limit=>10000))
order by start_time
;
You can specify a custom time range using the end_time_range_start
and end_time_range_end
parameters.
select
*
from table(snowflake.information_schema.query_history(
end_time_range_start=>dateadd('hours',-1,current_timestamp()),
end_time_range_end=>current_timestamp(),
result_limit=>10000
))
order by start_time
;
You can also restrict which columns are returned, or add filters. It’s important to note that any filters you add will be applied after the fact, meaning Snowflake will first return 10000 records, then filter them.
select
query_id,
query_text,
user_name,
execution_status,
start_time,
end_time
from table(snowflake.information_schema.query_history(result_limit=>10000))
where
user_name='IAN'
order by start_time
If possible, you should rely on the other available table functions when performing filtering:
QUERY_HISTORY_BY_SESSION
QUERY_HISTORY_BY_USER
QUERY_HISTORY_BY_WAREHOUSE
The example above could be re-written as:
select
query_id,
query_text,
user_name,
execution_status,
start_time,
end_time
from table(snowflake.information_schema.query_history_by_user(user_name=>'IAN', result_limit=>10000))
order by start_time
Latency & Data Retention
The main advantage of using the information_schema.query_history()
table function is that it has no latency. If a query completed one second ago, it will show up in the output of this function. The same goes for a query that is still running.
The main downsides of using this dataset are that:
- it only contains queries from the last 14 days
- You can only analyze the results of 10K queries
- Table functions are less intuitive to work with, compared to querying a view/table
If you need to analyze older queries or analyze trends over a longer time period, you’ll have to rely on the account_usage
view, discussed below.
Account Usage View (account_usage.query_history)
The snowflake.account_usage.query_history
view is my preferred way of analyzing & accessing the Query History dataset. You don’t need to remember how to work with table functions, and you can flexibly analyze data from the last year.
Here’s an example query where I count the number of queries from a user in the last 30 days:
select count(*)
from snowflake.account_usage.query_history
where
start_time > current_date - 30
and user_name='IAN'
order by start_time desc
For a full list of columns available, consult the Snowflake documentation.
Latency & Data Retention
The snowflake.account_usage.query_history
view can have a latency up to 45 minutes. With that said, you’ll often see data much sooner in practice.
In terms of data retention, this view contains data from the last 365 days.
Are all queries included in the Query History?
Generally speaking, all queries you run in Snowflake will show up in the Query History. This includes programatically executed queries, queries run through the UI, queries run by tasks or stored procedures, or queries executed by Streamlit dashboards. The only examples of queries excluded from the Query History is short-running queries that operate exclusively against hybrid tables. If you are trying to analyze these queries, you’ll need to rely on the aggregate query history view.
Query History Data Retention
The Query History data retention periods depends on which method you are using to access the dataset:
- Snowsight UI: 14 days
- Information Schema Query History Table Function: 14 days
- Account Usage Query History View: 365 days
Now that we’ve covered the basics for Query History and how to access it, let’s dive into some real examples! Everything covered below will rely on the account usage view discussed above.
1. Identify longest-running queries
To identify the longest-running queries in the last 30 days, you can sort by the total_elapsed_time
column:
select
query_id,
total_elapsed_time/1000 as total_elapsed_time_s, -- convert to seconds
user_name,
query_text
from snowflake.account_usage.query_history
where
start_time > current_date - 30
order by total_elapsed_time desc
2. Analyze query performance over time
Here’s an example to analyze the average and 90th percentile query performance each hour in a given warehouse.
select
date_trunc('hour', start_time) as start_time,
avg(total_elapsed_time)/1000 as avg_total_elapsed_time_s,
approx_percentile(total_elapsed_time, 0.90)/1000 as p90_total_elapsed_time_s,
from snowflake.account_usage.query_history
where
warehouse_name='SELECT_BACKEND'
and start_time between '2024-03-15' and '2024-03-31'
group by 1
;
You can even get a nice visualization directly in Snowsight!
3. Identify recurring query patterns with query_parameterized_hash
Snowflake recently (in 2023) introduced a new column called query_parameterized_hash
. This value is created by removing literals from the query and then hashing the query text. The motivation of this is to help you identify recurring queries that may only differ slightly due to the parameter values. You can learn more about the query_parameterized_hash
here, along with its limitations (it only works with certain comparison operations).
Here’s an example to identify the top 100 query patterns based on the total elapsed time. The example also pulls the latest query text, user name and warehouse associated with the query parameterized hash.
select
query_parameterized_hash,
count(*) as num_executions,
avg(total_elapsed_time)/1000 as avg_total_elapsed_time_s,
sum(total_elapsed_time) as total_elapsed_time_s,
max_by(query_text, start_time) as latest_query_text,
max_by(user_name, start_time) as latest_user_name,
max_by(warehouse_name, start_time) as latest_warehouse_name
from snowflake.account_usage.query_history
where
start_time > current_date - 7
group by 1
order by total_elapsed_time_s desc
limit 100
;
4. Top queries by query type
Snowflake includes a query_type
column to help you differentiate between different types of queries run in your account. Examples include SELECT
, CREATE_TABLE_AS_SELECT
, INSERT
, DELETE
and [MERGE](https://select.dev/posts/snowflake-merges)
.
Here’s a query to identify the most common query types in your account over the last 7 days:
select
query_type,
count(*) as cnt
from snowflake.account_usage.query_history
where
start_time > current_date - 7
group by 1
order by cnt desc
5. Queries run by a user in the last 3 months
To retrieve all the queries run by a user in the last 3 months, you can run the following:
select *
from snowflake.account_usage.query_history
where
user_name='IAN'
and start_time > current_date - 90
6. Metadata only queries
Certain queries don’t require a running virtual warehouse. They can instead be served from the Cloud Services layer using metadata only. For example, if you run: select count(*) from my_table
then this query can be answered entirely from metadata and does not need to run on a warehouse. Similarly, a query can avoid running on a warehouse if it has previously been run in the last 24 hours. Instead, Snowflake will return it instantly from the global result cache.
To identify these queries, filter to queries that do not have a listed warehouse size:
select *
from snowflake.account_usage.query_history
where
warehouse_size is null
and start_time > current_date - 90
6. Queries with high Cloud Services usage
The metadata-only queries mentioned above are generally free because they don’t require a running virtual warehouse and Snowflake only charges for Cloud Services if it exceeds 10% of your daily compute usage.
If your Cloud Services usage exceeds 10%, you can identify queries with high cloud services usage using the following query:
select
query_parameterized_hash,
count(*) as num_executions,
any_value(query_text) as example_query_text,
sum(credits_used_cloud_services) as total_credits_used_cloud_services
from snowflake.account_usage.query_history
where
start_time > current_date - 7
group by 1
order by total_credits_used_cloud_services desc
limit 100
;
7. Queries with poor pruning
If the table you are querying is not well clustered or you don’t include a filter to limit the data being accessed, your query will end up scanning a lot of micro-partitions. Because scanning data is expensive, this can result in longer running and more expensive queries.
Here’s an example query to identify queries with poor pruning. For this example, we’ll define poor pruning as queries which scan more than 1GB of data and 80% of the micro-partitions.
select
query_text,
partitions_scanned,
partitions_total,
partitions_scanned/partitions_total as fraction_scanned,
bytes_scanned/power(1024,3) as bytes_scanned_gb, -- convert to gigabytes
from snowflake.account_usage.query_history
where
start_time > current_date - 7
and bytes_scanned/power(1024,3) > 1
and fraction_scanned > 0.8
limit 100
8. Queries spilling to disk & remote storage
When your the virtual warehouse your query is running on runs out of memory, the query will begin spilling to the local disk. Once there is no space left on the local disk, it will start spilling to remote storage. This is both slow and expensive. If it’s not possible to re-write your query to remove the disk spillage (i.e. by processing less data), then it will likely be cheaper to run the query on a larger virtual warehouse.
Here’s how you can identify queries spilling to disk and remote storage:
select
query_text,
partitions_scanned,
partitions_total,
bytes_scanned/power(1024,3) as bytes_scanned_gb, -- convert to gigabytes
bytes_spilled_to_local_storage/power(1024,3) as bytes_spilled_to_local_storage_gb,
bytes_spilled_to_remote_storage/power(1024,3) as bytes_spilled_to_remote_storage_gb,
bytes_spilled_to_local_storage_gb + bytes_spilled_to_remote_storage_gb as total_spillage_gb
from snowflake.account_usage.query_history
where
start_time > current_date - 7
and total_spillage_gb > 0
order by total_spillage_gb desc
limit 100
9. Queries with out-of-memory errors and retries
As discussed above, Snowflake will try and gracefully handle out-of-memory errors by having the query spill to disk. However, sometimes there can be too many queries running concurrently which causes the node to crash. When this happens, Snowflake will automatically retry the query for you.
Snowflake recently added new query retry columns to help you identify when this is happening.
select
query_text,
query_retry_time,
query_retry_cause,
fault_handling_time,
partitions_scanned,
partitions_total,
bytes_scanned/power(1024,3) as bytes_scanned_gb, -- convert to gigabytes
bytes_spilled_to_local_storage/power(1024,3) as bytes_spilled_to_local_storage_gb,
bytes_spilled_to_remote_storage/power(1024,3) as bytes_spilled_to_remote_storage_gb,
bytes_spilled_to_local_storage_gb + bytes_spilled_to_remote_storage_gb as total_spillage_gb
from snowflake.account_usage.query_history
where
start_time > current_date - 30
and query_retry_time > 0
order by query_retry_time desc
limit 100
The query_retry_cause
even includes some tips on how to resolve the issue!