All posts

Analyzing your DAG to identify unused dbt models in Snowflake

Date
  • Jay Sobel
    Analytics Engineer at Ramp

Why care about unused dbt models?

One of the easiest ways to reduce unnecessary Snowflake spend is to get rid of things that aren't being used. In a previous post on identifying unused tables in Snowflake, Ian explained how Snowflake’s Account Usage views can be used to introspect Snowflake object usage to ultimately identify and remove tables that aren't being actively queried, allowing users to save on storage costs. When it comes to tables that are created and continuously updated by ELT tools like dbt, the potential savings from removing these tables is much higher, as users will save on the compute costs associated with creating and updating the table in addition to the storage costs.

If your dbt project has been around for over a year, it's quite likely that you'll have a number of dbt models that are no longer being used, but are still running every day and driving compute costs. If you're looking for a quick win to both lower your costs and improve the cleanliness of your data warehouse, this post is for you!

Understanding dbt model usage

In this article I’ll expand on the premise of understanding Snowflake object usage to specifically capture dbt model usage. This requires one additional model to represent the relationship between dbt models (the DAG, as a table) so that intermediate models with 0 direct usage aren’t flagged as unused, so long as their downstreams have some query activity. I recommend starting with the original post, at least to get familiar with the account_usage schema.

To understand why we can’t use the approach from that blog post to identify unused dbt models, consider the following DAG:

Example dbt model DAG in Snowflake

If we were to query for unused tables we would initially identify every table as having some usage, but that usage would be from dbt itself running tests, or building downstream models. Once we exclude queries run by dbt, we might correctly identify the top row; stg_fulfillments, fct_fulfillments and fulfillments_rollup as unused models, but our output would also claim the entire stg_ layer is unused. In dbt, direct usage is not the only concern. We also need to consider the usage of downstream dependents.

We can do this by building a model that captures dbt model descendants, and then do some clever aggregation of queries ‘upwards’ over these DAG dependencies.

An overview of the approach

Let's consider an even simpler DAG with just 4 models. In order to properly identify unused dbt models, we first need to understand which models rely on each other.

Simple dbt model DAG in Snowflake

For each model, we need to list all the downstream models. Here's how this simple DAG will look in the new dependencies model we'll create. The green rows are a node-and-itself, the orange rows are direct parents, and the purple row shows that a direct parent can also be an indirect parent.

Modelling the dbt DAG in Snowflake

Once we have this model, we can do things like determine whether the Alice model can be safely removed by checking the usage on the downstream dependencies: Bob, Chad, and Delta.

Prerequisites

To determine which tables are being used, we’ll leverage the models discussed in the previous article. Both of these are available in the dbt-snowflake-monitoring package built & maintained by SELECT.

  • dbt_snowflake_monitoring/models/query_base_object_access.sql
  • dbt_snowflake_monitoring/models/query_history_enriched.sql

For our model of dbt dependents, we’ll be building something new: dbt_model_descendants. This can be derived from dbt-snowflake-monitoring or more accurately via dbt_artifacts if you have it set up. I’ll provide SQL for either source:

  • Option 1: dbt_snowflake_monitoring/dbt_queries.sql
  • Option 2: dbt_artifacts/dim_dbt__current_models.sql

How to model dependencies in your dbt DAG

Step 1: get each model's parents

Our first step is to to derive a table with one row for each dbt model with an array column capturing the model’s direct parents.

nodetable_skparent_array
customer_activityprod.analytics.customer_activity[”customers”, “events”]
eventsprod.analytics.events[”stg_events”]

To build this dataset, there are two options.

Using dbt_snowflake_monitoring

The first option is to use dbt_snowflake_monitoring/dbt_queries.sql, which you should have already have installed for the other required mdoels (query_base_object_access, query_history_enriched). The two main drawbacks of this option are that deleted models will continue to be included for a couple days after they leave the project, and that sources are never included, because they are not "refs".

select
dbt_node_name as node,
lower(concat(dbt_target_database, '.', dbt_target_schema, '.', dbt_node_alias)) as table_sk,
dbt_node_refs as parent_array
from dbt_queries
where
start_time > current_date - 3 -- tunes risk of deleted model inclusion
and dbt_node_resource_type in ('model', 'snapshot', 'seed')
and execution_status = 'SUCCESS'
-- [optional] add additional filters if you want to exclude certain environments or projects
-- and dbt_node_package_name = <my project>
-- and dbt_target_name = <my target>
-- and dbt_target_database = <my prod db>
-- and dbt_target_schema in <my prod schemas>
qualify row_number() over (partition by dbt_node_name order by end_time desc) = 1

Using dbt_artifacts

The second option is to use dbt_artifacts/dim_dbt__current_models. This is the more robust option, but requires the dbt_artifacts package which has a more complicated set-up process.

select
split_part(node_id, '.', 3) as node,
lower(concat(database, '.', schema, '.', name)) as table_sk,
depends_on_nodes as parent_array,
from dim_dbt__current_models
where
-- [optional] filter to specific databases
-- database in (<your databases>)

Step 2: Derive node children

Now that we have a list of nodes, we will create a new CTE, node_children by flattening the nodes CTE. This maps out the "first degree parents".

Using dbt_snowflake_monitoring

with
nodes as (
select
dbt_node_name as node,
dbt_node_refs as parent_array,
lower(concat(dbt_target_database, '.', dbt_target_schema, '.', dbt_node_alias)) as table_sk,
query_id
from dbt_queries_select
where true
and start_time > current_date - 3 -- tunes risk of deleted model inclusion
and dbt_node_resource_type in ('model', 'snapshot', 'seed')
and execution_status = 'SUCCESS'
qualify row_number() over (partition by dbt_node_name order by end_time desc) = 1
),
-- Unpack the parents (refs) array and swap the relationship into node -> descendent terms.
node_children as (
select
parents.value::text as node,
nodes.node as descendant
from nodes, lateral flatten(input => parent_array) as parents
)

Using dbt_artifacts

with
nodes as (
select
-- assume packaged model names do not collide
split_part(node_id, '.', 3) as node,
lower(concat(database, '.', schema, '.', name)) as table_sk,
depends_on_nodes as parent_array
from dim_dbt__current_models
),
-- Unpack the parents (refs) array and swap the relationship into node -> descendent terms.
node_children as (
select
split_part(parents.value::text, '.', 3) as node,
nodes.node as descendant
from nodes,
lateral flatten(input => parent_array) as parents
-- Parents include: model, seed, source, metric, snapshot, but all except model are excluded here.
where
split_part(parents.value::text, '.', 1) in ('model', 'snapshot', 'seed')
),

Step 3: Recursively find all model descendants

The rest of the query is the same regardless of whether you're using dbt-snowflake-monitoring or dbt-artifacts. It carries out the following steps:

  • Derive node_descendants_recursive (all degrees) by recursively joining node_children (from above) to itself
  • The granularity at this point is "all paths"
  • Union an additional row for "a node and itself"
  • Aggregate node_descendants to unique node-descendant pairs

Here's the query assuming you are using dbt-snowflake-monitoring:

with
nodes as (
select
dbt_node_name as node,
lower(concat(dbt_target_database, '.', dbt_target_schema, '.', dbt_node_alias)) as table_sk,
dbt_node_refs as parent_array
from dbt_queries
where
start_time > current_date - 3 -- tunes risk of deleted model inclusion
and dbt_node_resource_type in ('model', 'snapshot', 'seed')
and execution_status = 'SUCCESS'
-- [optional] add additional filters if you want to exclude certain environments or projects
-- and dbt_node_package_name = <my project>
-- and dbt_target_name = <my target>
-- and dbt_target_database = <my prod db>
-- and dbt_target_schema in <my prod schemas>
qualify row_number() over (partition by dbt_node_name order by end_time desc) = 1
),
-- Unpack the parents (refs) array and swap the relationship into node -> descendent terms.
node_children as (
select
parents.value::text as node,
nodes.node as descendant
from nodes, lateral flatten(input => parent_array) as parents
),
-- Traverse the children of children etc. building up lineaged pairs and path arrays (creates one row per *path* between nodes)
node_descendants_recursive (node, descendant, depth, path_array) as (
select
node,
descendant,
1 as depth,
array_construct(node, descendant) as path_array
from node_children
union all
select
node_descendants_recursive.node,
node_children.descendant,
node_descendants_recursive.depth + 1 as depth,
array_cat(node_descendants_recursive.path_array, to_array(node_children.descendant)) as path_array
from node_descendants_recursive
inner join node_children
on node_descendants_recursive.descendant = node_children.node
where
node_descendants_recursive.depth < 50 -- recursion hard-stop.
),
-- Add every node as its own parent with depth 0
node_descendants_and_self as (
select
node,
descendant,
depth,
path_array
from node_descendants_recursive
union all
select
node,
node as descendant,
0 as depth,
to_array(node) as path_array
from nodes
),
-- De-dupe paths between nodes
node_descendants as (
select
node,
descendant,
count(*) as count_paths,
min(depth) as min_depth,
-- Path arrays for fun. Depth ties are common and non-deterministic here.
min_by(path_array, depth) as shortest_path_array
from node_descendants_and_self
group by 1, 2
),
final as (
select
node_descendants.node,
node_descendants.descendant,
concat(node_descendants.node, '-', node_descendants.descendant) as node_descendant_sk,
node.table_sk,
descendant.table_sk as descendant_table_sk
from node_descendants
inner join nodes as node
on node_descendants.node = node.node
inner join nodes as descendant
on node_descendants.descendant = descendant.node
)
select *
from final

Head to the appendix for a version of this query you can leverage in your dbt project.

How to query for unused dbt models

With our new dbt_model_descendants model we created that accounts for model dependencies (or should we say descendencies?) we can aggregate direct table usage and distribute it upward through the DAG. This looks like a join of query counts on the descendant side, conditionally aggregated around the parent. The self-edge comes into play here as the conditional aggregation can differentiate direct vs indirect usage by checking if the descendant is actually the node itself.

with
table_queries as (
select
lower(query_base_object_access.object_name) as table_sk,
count(*) as count_queries
from query_history_enriched_select
inner join query_base_object_access
on query_history_enriched_select.query_id = query_base_object_access.query_id
and query_history_enriched_select.start_time = query_base_object_access.query_start_time
where
query_history_enriched_select.start_time > current_date - 180
and query_history_enriched_select.query_type = 'SELECT'
and query_history_enriched_select.execution_status = 'SUCCESS'
-- exclude dbt queries
and dbt_metadata is null
-- optional: remove usage from tooling that automatically runs queries
-- and user_name not in (<TOOLING USERS>)
group by 1
)
select
md.table_sk,
sum(iff(node = descendant, tq.count_queries, 0)) as direct_queries,
sum(iff(node <> descendant, tq.count_queries, 0)) as downstream_queries,
direct_queries + downstream_queries as total_queries
from final as md
left join table_queries as tq
on md.descendant_table_sk = tq.table_sk
group by 1
order by total_queries asc

This query tells us how many "usage" queries directly hit each dbt model, as well as how many usage queries are distributed across a model’s downstream descendants. If a model has total_queries = 0 then it’s both not serving direct usage, and not supporting any downstream direct usage. Do note that downstream_queries and total_queries will be higher than your actual total Snowflake query count since a single query can be counted across more than one mode.

What to do with unused models

As an analytics engineer, I know more about building new tables than I do about deleting old ones.

Most dbt models are fixed transformations of raw data that can be switched off and on without "missing" anything. Sure the production model will go stale until the model is turned back on, but there will be irretrievable loss of information. In these cases, simply disabling the model, or deleting it and letting it live on in your git history are both good options. I’d recommend also dropping the table at this stage, just to avoid any users accessing stale data.

Models like dbt snapshots, or other fancy incremental schemes might not fit this description. Deprecating something in this category is going to require more case-specific considerations, but having been in this position, I know it’s also probably likely that nobody knows what the model is for, or even what its intentions are.

How to remove a dbt model from your project

Steps for deleting a model:

  1. Delete the .sql model file.
  2. Ctrl+Shift+F search the model’s name in the whole project to find…
    • refs() to the model
    • schema or congi .yml references to the model.
  3. Drop the corresponding Snowflake table (or view)

You probably won’t have to update a ref() if you’re using this approach because any model referencing an unused model must also be unused (or else the parent would have downstream usage!). If there is a chain of unused models, I’d recommend starting at the end and working backwards; in A -> B -> C, delete C first!

Disabling a model is a quick way to turn off a model without deleting any code. Disabled models act like they don’t exist, but their code stays in your project; it’s just a config line;

-- my_unused_model.sql
{{ config(enabled = false) }}
select ...

This might be the quickest and easiest-to-reverse way to shut off a model, but assuming you’re using git already, you wouldn’t be losing any of the code you delete, either. And assuming you’re concerned with model sprawl, you’re probably better off taking unused models out to the dumpster, rather than declaring a dedicated trash corner.

Lastly, make sure to thank your models for their hard work. As the great Data Engineer Marie Kondo says;

Cherish the [analytical models] that bring you joy, and let go of the rest with gratitude.

Appendix - files for your dbt model project

{{ config(materialized='table') }}
with
nodes as (
select
dbt_node_name as node,
lower(concat(dbt_target_database, '.', dbt_target_schema, '.', dbt_node_alias)) as table_sk,
dbt_node_refs as parent_array
from {{ ref('dbt_queries') }}
where
start_time > current_date - 3 -- tunes risk of deleted model inclusion
and dbt_node_resource_type in ('model', 'snapshot', 'seed')
and execution_status = 'SUCCESS'
-- [optional] add additional filters if you want to exclude certain environments or projects
-- and dbt_node_package_name = <my project>
-- and dbt_target_name = <my target>
-- and dbt_target_database = <my prod db>
-- and dbt_target_schema in <my prod schemas>
qualify row_number() over (partition by dbt_node_name order by end_time desc) = 1
),
-- Unpack the parents (refs) array and swap the relationship into node -> descendent terms.
node_children as (
select
parents.value::text as node,
nodes.node as descendant
from nodes, lateral flatten(input => parent_array) as parents
),
-- Traverse the children of children etc. building up lineaged pairs and path arrays (creates one row per *path* between nodes)
node_descendants_recursive (node, descendant, depth, path_array) as (
select
node,
descendant,
1 as depth,
array_construct(node, descendant) as path_array
from node_children
union all
select
node_descendants_recursive.node,
node_children.descendant,
node_descendants_recursive.depth + 1 as depth,
array_cat(node_descendants_recursive.path_array, to_array(node_children.descendant)) as path_array
from node_descendants_recursive
inner join node_children
on node_descendants_recursive.descendant = node_children.node
where
node_descendants_recursive.depth < 50 -- recursion hard-stop.
),
-- Add every node as its own parent with depth 0
node_descendants_and_self as (
select
node,
descendant,
depth,
path_array
from node_descendants_recursive
union all
select
node,
node as descendant,
0 as depth,
to_array(node) as path_array
from nodes
),
-- De-dupe paths between nodes
node_descendants as (
select
node,
descendant,
count(*) as count_paths,
min(depth) as min_depth,
-- Path arrays for fun. Depth ties are common and non-deterministic here.
min_by(path_array, depth) as shortest_path_array
from node_descendants_and_self
group by 1, 2
),
final as (
select
node_descendants.node,
node_descendants.descendant,
concat(node_descendants.node, '-', node_descendants.descendant) as node_descendant_sk,
node.table_sk,
descendant.table_sk as descendant_table_sk
from node_descendants
inner join nodes as node
on node_descendants.node = node.node
inner join nodes as descendant
on node_descendants.descendant = descendant.node
)
select *
from final
version: 2
models:
- name: dbt_model_descendants
description: >-
A table mapping each DAG model node to all of its descendant model nodes. The
mapping includes the model's self as a descendant with depth = 0. Sources are not included.
columns:
- name: node_descendant_sk
description: Unique identifier of a node-descendant pairing
tests:
- unique
- not_null
- name: node
description: The name of a node in the DAG
- name: descendant
description: The name of a downstream node
- name: table_sk
description: Surrogate key uniquely identifying the node's database.schema.table name
- name: descendant_table_sk
description: >-
Surrogate key uniquely identifying the downstream node's
database.schema.table name
Jay Sobel
Analytics Engineer at Ramp
Jay is a Senior Analytics Engineer at Ramp, one of the fastest growing startups in the US. Jay has nearly a decade of data analysis & engineering experience, spanning many fast growing technology companies like Gopuff, Drizly, Wanderu and LevelUp. Jay is a passionate member of the dbt and Snowflake community, regular contributing to optimization and general best practice discussions.

Get up and running with SELECT in 15 minutes.

Snowflake optimization & cost management platform

Gain visibility into Snowflake usage, optimize performance and automate savings with the click of a button.

SELECT web application screenshot

Want to hear about our latest Snowflake learnings? 
Subscribe to get notified.