Snowflake Access History: 8 ways to audit your account
- Date
- Ian WhitestoneCo-founder & CEO of SELECT
Access History is a view provided in the Snowflake database that is one of the most useful datasets when it comes to auditing and understanding usage in your Snowflake account. In this post, I’ll dive into what data the access history contains, and then share a number of different examples you can run in your account today.
What is in the Snowflake Access History?
The Access History contains 1 row per query executed in your account. For each query, it stores a number of different columns about objects accessed and/or modified by the query.
To start with, there are three columns that are helpful for looking up the queries you are interested in:
query_id
: the unique identifier for the queryuser_name
: the user who ran the queryquery_start_time
: when the query started
If additional information about the query is required, like the role it was executed with or the warehouse it ran on, you can join the access history dataset with the Snowflake Query History view.
In terms of what objects the query accessed, the view provides two columns:
direct_objects_accessed
: a JSON array of data objects the query directly accessesbase_objects_accessed
: a JSON array of data objects that a query either directly or indirectly accesses (i.e. the underlying tables that populate a view)
For objects a query modified, there are two columns:
objects_modified
: a JSON array that specifies the objects modified by a query. This will be populated forINSERT
,UPDATE
,MERGE
,CREATE
or similar types of queries that update/insert/delete records in a tableobjects_modified_by_ddl
: contains information about the DDL operation on a database, schema, table, view and/or column.
Direct vs. Base Objects Accessed
To understand the difference between direct and base objects accessed, consider the following query which accesses two columns from a view named user_sales_summary
:
select
user_name,
total_sales
from user_sales_summary
The direct_objects_accessed
column would include 1 entry for the direct access of the user_sales_summary
view, while the base_objects_accessed
column would contain two entries for the two underlying tables (users
and sales
) that power the view.
Access History Data Retention
Similar to other account usage views like the Snowflake Query History, Snowflake retains data from the last 365 days.
Is Access History available to all Snowflake customers?
The access history view is only available for Snowflake customers on a Snowflake Enterprise editions or higher.
Now that we’ve covered the basics, let’s get into some real examples you can run in your account to answer a variety of common questions.
1. Find all tables a given user accessed in the last 30 days
The query below shows how to find all tables accessed by a given user in the last 30 days. Because the base_objects_accessed
column is an array, we must use a lateral
join combined with the flatten
table function to explode each entry in the array into a separate row. You’ll see this pattern used throughout the blog post.
with
-- This will output 1 row per table accessed in a query
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
from snowflake.account_usage.access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed
where
access_history.query_start_time > current_date - 30
)
select
object_name,
count(*) as number_of_times_accessed
from access_history_flattened
where
user_name='IAN'
and object_domain='Table'
group by 1
order by 2 desc
;
Note how I filter on object_domain='Table'
. You can modify this as required to answer related questions like:
- What views did a user access?
- What functions did they use?
2. Find all tables accessed in a schema?
To find all tables accessed in a particular schema, we can leverage the access_history_flattened
CTE from above. The object_name
present in Access History will always be the fully qualified name, meaning it will be in the format of database_name.schema_name.table_name
. As a result, we parse this object name to get the database and schema name, and then filter as needed:
with
-- This will output 1 row per table accessed in a query
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
from snowflake.account_usage.access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed
),
access_history_flattened_w_names as (
select
*,
split(object_name, '.')[0]::string as database_name,
split(object_name, '.')[1]::string as schema_name,
split(object_name, '.')[2]::string as table_name,
from access_history_flattened
)
select
table_name,
count(*) as number_of_times_accessed
from access_history_flattened_w_names
where
object_domain='Table'
and database_name='ANALYTICS'
and schema_name='PUBLIC'
group by 1
order by 2 desc
;
3. Return all users who accessed a specific table in the last 30 days
Imagine you are trying to identify users that may have accessed sensitive data in a table. You can leverage the access_history view to quickly identify the full list of users:
with
-- This will output 1 row per table accessed in a query
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
from snowflake.account_usage.access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed
where
access_history.query_start_time > current_date - 30 -- adjust as needed
),
access_history_flattened_w_names as (
select
*,
split(object_name, '.')[0]::string as database_name,
split(object_name, '.')[1]::string as schema_name,
split(object_name, '.')[2]::string as table_name,
from access_history_flattened
)
select
user_name,
count(*) as number_of_times_accessed
from access_history_flattened_w_names
where
object_domain='Table'
and database_name='ANALYTICS'
and schema_name='PUBLIC'
and table_name='TABLE_WITH_SENSITIVE_DATA'
group by 1
order by 2 desc
;
4. Identify Unused Tables
I previously wrote about how to identify unused tables by leveraging the access history view. You can refer to that post for a detailed explanation. Here is the code you can use:
with
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectId::integer as table_id,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
from snowflake.account_usage.access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed
),
table_access_history as (
select
query_id,
query_start_time,
user_name,
object_name as fully_qualified_table_name,
table_id
from access_history_flattened
where
object_domain = 'Table'
),
table_access_summary as (
select
table_id,
max(query_start_time) as last_accessed_at,
max_by(user_name, query_start_time) as last_accessed_by,
max_by(query_id, query_start_time) as last_query_id
from table_access_history
group by 1
),
table_storage_metrics as (
select
id as table_id,
table_catalog || '.' ||table_schema ||'.' || table_name as fully_qualified_table_name,
(active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes)/power(1024,4) as total_storage_tb
from snowflake.account_usage.table_storage_metrics
where
not deleted
)
select
table_storage_metrics.*,
table_access_summary.* exclude (table_id)
from table_storage_metrics
left join table_access_summary
on table_storage_metrics.table_id=table_access_summary.table_id
where
coalesce(last_accessed_at, date'1900-01-01') < (current_date - 30) -- Modify as needed
order by table_storage_metrics.total_storage_tb desc
5. Identify Unused Views
We can easily modify the query from above to identify Views that have not been used in the last 30 days. All we need to do is change object_domain = 'Table'
to object_domain = 'View'
:
with
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectId::integer as table_id,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
from snowflake.account_usage.access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed
),
table_access_history as (
select
query_id,
query_start_time,
user_name,
object_name as fully_qualified_table_name,
table_id
from access_history_flattened
where
object_domain = 'View'
),
table_access_summary as (
select
table_id,
max(query_start_time) as last_accessed_at,
max_by(user_name, query_start_time) as last_accessed_by,
max_by(query_id, query_start_time) as last_query_id
from table_access_history
group by 1
),
table_storage_metrics as (
select
id as table_id,
table_catalog || '.' ||table_schema ||'.' || table_name as fully_qualified_table_name,
(active_bytes + time_travel_bytes + failsafe_bytes + retained_for_clone_bytes)/power(1024,4) as total_storage_tb
from snowflake.account_usage.table_storage_metrics
where
not deleted
)
select
table_storage_metrics.*,
table_access_summary.* exclude (table_id)
from table_storage_metrics
left join table_access_summary
on table_storage_metrics.table_id=table_access_summary.table_id
where
coalesce(last_accessed_at, date'1900-01-01') < (current_date - 30) -- Modify as needed
order by table_storage_metrics.total_storage_tb desc
6. Identify the most common columns accessed in a given table
The examples thus far have solely addressed questions about table/schema access. We can go a layer deeper to analyze column usage for a given table by leveraging the columns
array present in the base/direct_objects_accessed
fields. After performing an extra lateral flatten
, we can get a dataset with 1 row per column accessed in a query (see access_history_flattened_columns
CTE).
with
-- This will output 1 row per table accessed in a query
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
from snowflake.account_usage.access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed
where
access_history.query_start_time > current_date - 30 -- adjust as needed
),
-- 1 row per column accessed in a query
access_history_flattened_columns as (
select
access_history_flattened.* exclude columns_array,
columns_accessed.value:columnName::text as column_name
from access_history_flattened, lateral flatten(access_history_flattened.columns_array) as columns_accessed
)
select
column_name,
count(*) as number_of_times_accessed
from access_history_flattened_columns
where
object_domain='Table'
and object_name='ANALYTICS.PUBLIC.SALES'
group by 1
order by 2 desc
7. Return all users who accessed a specific column in the last 30 days
Working off the example from above, we can easily identify users who have accessed a specific column:
with
-- This will output 1 row per table accessed in a query
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_accessed.value:objectName::text as object_name,
objects_accessed.value:objectDomain::text as object_domain,
objects_accessed.value:columns as columns_array
from snowflake.account_usage.access_history, lateral flatten(access_history.base_objects_accessed) as objects_accessed
where
access_history.query_start_time > current_date - 30 -- adjust as needed
),
-- 1 row per column accessed in a query
access_history_flattened_columns as (
select
access_history_flattened.* exclude columns_array,
columns_accessed.value:columnName::text as column_name
from access_history_flattened, lateral flatten(access_history_flattened.columns_array) as columns_accessed
)
select
user_name,
count(*) as number_of_times_accessed
from access_history_flattened_columns
where
object_domain='Table'
and object_name='ANALYTICS.PUBLIC.SALES'
and column_name='amount'
group by 1
order by 2 desc
8. Identify all queries that have modified a table
When investigating why or how a given table changes, it can be helpful to quickly identify the queries or users that modified the object. Or perhaps you want to see how often a table is being updated. Using similar approaches to ones from above, we can identify all queries that have modified a table by flattening the objects_modified
column:
with
-- This will output 1 row per table accessed in a query
access_history_flattened as (
select
access_history.query_id,
access_history.query_start_time,
access_history.user_name,
objects_modified.value:objectName::text as object_name,
objects_modified.value:objectDomain::text as object_domain
from snowflake.account_usage.access_history, lateral flatten(access_history.objects_modified) as objects_modified
where
access_history.query_start_time > current_date - 30 -- adjust as needed
)
select
query_id,
user_name,
query_start_time
from access_history_flattened
where
object_domain='Table'
and object_name='ANALYTICS.PUBLIC.SALES'
;