All posts

Identifying unused tables in Snowflake

Date
  • Ian Whitestone
    Co-founder & CEO of SELECT

While Snowflake storage costs tend to be a small portion of overall Snowflake spend, many customers have a significant number of unused tables in their accounts incurring unnecessary charges. If a dataset isn’t being used, adding value to the business or required by law to be stored, it should be removed.

Removing unused datasets can be a quick win for teams looking to reduce their Snowflake spend. It can also improve security and reduce risks associated with data breaches and data exposure. The less data you store, the smaller the footprint for unintended access.

Lastly, deleting unused tables can improve overall data warehouse usability. Unused datasets often contain data that is stale or not meant to be accessed, so removing these tables can help avoid confusion or reporting errors.

In this post we’ll cover how to identify unused tables in Snowflake using the access_history account usage view.

Skip to the final SQL?

If you want to skip ahead and see the final SQL implementation, you can head straight to the end!

Snowflake Access History View

Access History is a view in the Account Usage schema of the Snowflake Database. It is available for all Snowflake accounts on Enterprise Edition or higher. Access History can be used to look up the Snowflake objects (i.e. tables, views, and columns) accessed by each query, either directly or indirectly.

Direct versus Base Objects Accessed

To determine which columns were accessed by a query, there are two columns of interest: direct_objects_accessed and base_objects_accessed. A key difference between the two columns comes from how they handle views. Consider the following view definition:

create or replace view orders_view as (
select *
from orders
where
not test
and success
);

The query select * from orders_view directly accesses the orders_view object, and indirectly accesses the base orders table. Correspondingly, orders_view will appear in the direct_objects_accessed column of access_history, whereas orders will appear in base_objects_accessed.

When it comes to deciding if a table is unused, it’s important to use base_objects_accessed since this will account for queries that indirectly access a table through a view.

Parsing base_objects_accessed

base_objects_accessed is a JSON array of all base data objects accessed during query execution. Here’s an example of the column’s contents from the documentation:

[
{
"columns": [
{
"columnId": 68610,
"columnName": "CONTENT"
}
],
"objectDomain": "Table",
"objectId": 66564,
"objectName": "GOVERNANCE.TABLES.T1"
}
]

The array of objects accessed by each query can be transformed to one row per object via lateral flatten and then filtered to only consider table objects, as shown below:

with
access_history as (
select *
from snowflake.account_usage.access_history
),
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectId::integer as table_id,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
from access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed
),
table_access_history as (
select
query_id,
query_start_time,
user_name,
object_name as fully_qualified_table_name
from access_history_flattened
where
object_domain = 'Table' -- removes secured views
and table_id is not null -- removes tables from a data share
)
select *
from table_access_history

Find when a table was last queried/accessed

Using the “flattened” access_history from the query above, we can determine the exact time a table was last accessed along with the user who ran the query:

with
access_history as (
select *
from snowflake.account_usage.access_history
),
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectId::integer as table_id,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
from access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed
),
table_access_history as (
select
query_id,
query_start_time,
user_name,
object_name as fully_qualified_table_name
from access_history_flattened
where
object_domain = 'Table' -- removes secured views
and table_id is not null -- removes tables from a data share
)
select
fully_qualified_table_name,
max(query_start_time) as last_accessed_at,
max_by(user_name, query_start_time) as last_accessed_by,
max_by(query_id, query_start_time) as last_query_id
from table_access_history
group by 1

Calculate table storage costs

When identifying unused tables to delete, it’s helpful to see the associated storage costs. Using the table_storage_metrics account usage view and an assumed storage rate of $23 per terabyte per month, the annual storage cost of each table can be calculated:

select
id as table_id,
table_catalog || '.' ||table_schema ||'.' || table_name as fully_qualified_table_name,
(active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes)/power(1024,4) as total_storage_tb,
-- Assumes a storage rate of $23/TB/month
-- Update to the appropriate value based on your Snowflake contract
total_storage_tb*12*23 as annualized_storage_cost
from snowflake.account_usage.table_storage_metrics
where
not deleted

Identify all tables not queried in the last X days

So far we’ve covered how to determine when a table was last accessed and the storage costs associated with each table. We can tie these building blocks together to identify all tables not queried in the last 90 days and show the annual savings that could be expected if the tables were deleted.

Standard Edition not supported!

The SQL below relies on the account_usage.access_history view which is only available for Snowflake customers on Enterprise Edition and higher.

Using dbt?

If you are using dbt, consider the alternative version of this SQL which runs must faster.

with
access_history as (
select *
from snowflake.account_usage.access_history
),
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectId::integer as table_id,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
from access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed
),
table_access_history as (
select
query_id,
query_start_time,
user_name,
object_name as fully_qualified_table_name,
table_id
from access_history_flattened
where
object_domain = 'Table' -- removes secured views
and table_id is not null -- removes tables from a data share
),
table_access_summary as (
select
table_id,
max(query_start_time) as last_accessed_at,
max_by(user_name, query_start_time) as last_accessed_by,
max_by(query_id, query_start_time) as last_query_id
from table_access_history
group by 1
),
table_storage_metrics as (
select
id as table_id,
table_catalog || '.' ||table_schema ||'.' || table_name as fully_qualified_table_name,
(active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes)/power(1024,4) as total_storage_tb,
-- Assumes a storage rate of $23/TB/month
-- Update to the appropriate value based on your Snowflake contract
total_storage_tb*12*23 as annualized_storage_cost
from snowflake.account_usage.table_storage_metrics
where
not deleted
)
select
table_storage_metrics.*,
table_access_summary.* exclude (table_id)
from table_storage_metrics
inner join table_access_summary
on table_storage_metrics.table_id=table_access_summary.table_id
where
last_accessed_at < (current_date - 30) -- Modify as needed
order by table_storage_metrics.annualized_storage_cost desc

Identifying unused tables with dbt

Querying and flattening the access_history view can be very slow due to the volume of data that must be processed. For faster queries about table access history, we recommend incrementally materializing this data using our open-source dbt package: dbt_snowflake_monitoring. Once the package is installed, queries to identify unused tables becomes much simpler. The code from above can be re-written as:

with
table_access_summary as (
select
table_id,
max(query_start_time) as last_accessed_at,
max_by(user_name, query_start_time) as last_accessed_by,
max_by(query_id, query_start_time) as last_query_id
from query_base_table_access
group by 1
),
table_storage_metrics as (
select
id as table_id,
table_catalog || '.' ||table_schema ||'.' || table_name as fully_qualified_table_name,
(active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes)/power(1024,4) as total_storage_tb,
-- Assumes a storage rate of $23/TB/month
-- Update to the appropriate value based on your Snowflake contract
total_storage_tb*12*23 as annualized_storage_cost
from snowflake.account_usage.table_storage_metrics
where
not deleted
)
select
table_storage_metrics.*,
table_access_summary.* exclude (table_id)
from table_storage_metrics
inner join table_access_summary
on table_storage_metrics.table_id=table_access_summary.table_id
where
last_accessed_at < (current_date - 30) -- Modify as needed
order by table_storage_metrics.annualized_storage_cost desc

Find when a table was last updated

As part of deciding whether to delete a table, it can be helpful to know when the table was last updated by a DDL or DML operation. The query below shows how to find all tables that were updated in the past week by using the tables account usage view:

select
table_id,
table_catalog||'.'||table_schema||'.'||table_name as fully_qualified_table_name,
last_altered as last_altered_at
from snowflake.account_usage.tables
where
last_altered > current_date - 7

Wrapping Up

Removing unused tables represents one of the many cost saving opportunities available to Snowflake users. In addition to surfacing table access patterns, SELECT automatically produces a variety of other optimization recommendations. Get access today or book a demo using the links below.

Ian Whitestone
Co-founder & CEO of SELECT
Ian is the Co-founder & CEO of SELECT, a software product which helps users automatically optimize, understand and monitor Snowflake usage. Prior to starting SELECT, Ian spent 6 years leading full stack data science & engineering teams at Shopify and Capital One. At Shopify, Ian led the efforts to optimize their data warehouse and increase cost observability.

Get up and running with SELECT in 15 minutes.

Snowflake optimization & cost management platform

Gain visibility into Snowflake usage, optimize performance and automate savings with the click of a button.

SELECT web application screenshot

Want to hear about our latest Snowflake learnings? 
Subscribe to get notified.