Analyzing your DAG to identify unused dbt models in Snowflake

Why care about unused dbt models?

One of the easiest ways to reduce unnecessary Snowflake spend is to get rid of things that aren't being used. In a previous post on identifying unused tables in Snowflake, Ian explained how Snowflake’s Account Usage views can be used to introspect Snowflake object usage to ultimately identify and remove tables that aren't being actively queried, allowing users to save on storage costs. When it comes to tables that are created and continuously updated by ELT tools like dbt, the potential savings from removing these tables is much higher, as users will save on the compute costs associated with creating and updating the table in addition to the storage costs.

If your dbt project has been around for over a year, it's quite likely that you'll have a number of dbt models that are no longer being used, but are still running every day and driving compute costs. If you're looking for a quick win to both lower your costs and improve the cleanliness of your data warehouse, this post is for you!

Understanding dbt model usage

In this article I’ll expand on the premise of understanding Snowflake object usage to specifically capture dbt model usage. This requires one additional model to represent the relationship between dbt models (the DAG, as a table) so that intermediate models with 0 direct usage aren’t flagged as unused, so long as their downstreams have some query activity. I recommend starting with the original post, at least to get familiar with the account_usage schema.

To understand why we can’t use the approach from that blog post to identify unused dbt models, consider the following DAG:

If we were to query for unused tables we would initially identify every table as having some usage, but that usage would be from dbt itself running tests, or building downstream models. Once we exclude queries run by dbt, we might correctly identify the top row; stg_fulfillments, fct_fulfillments and fulfillments_rollup as unused models, but our output would also claim the entire stg_ layer is unused. In dbt, direct usage is not the only concern. We also need to consider the usage of downstream dependents.

We can do this by building a model that captures dbt model descendants, and then do some clever aggregation of queries ‘upwards’ over these DAG dependencies.

An overview of the approach

Let's consider an even simpler DAG with just 4 models. In order to properly identify unused dbt models, we first need to understand which models rely on each other.

For each model, we need to list all the downstream models. Here's how this simple DAG will look in the new dependencies model we'll create. The green rows are a node-and-itself, the orange rows are direct parents, and the purple row shows that a direct parent can also be an indirect parent.

Once we have this model, we can do things like determine whether the Alice model can be safely removed by checking the usage on the downstream dependencies: Bob, Chad, and Delta.

Prerequisites

To determine which tables are being used, we’ll leverage the models discussed in the previous article. Both of these are available in the dbt-snowflake-monitoring package built & maintained by SELECT.

dbt_snowflake_monitoring/models/query_base_object_access.sql
dbt_snowflake_monitoring/models/query_history_enriched.sql

For our model of dbt dependents, we’ll be building something new: dbt_model_descendants. This can be derived from dbt-snowflake-monitoring or more accurately via dbt_artifacts if you have it set up. I’ll provide SQL for either source:

Option 1: dbt_snowflake_monitoring/dbt_queries.sql
Option 2: dbt_artifacts/dim_dbt__current_models.sql

How to model dependencies in your dbt DAG

Step 1: get each model's parents

Our first step is to derive a table with one row for each dbt model with an array column capturing the model’s direct parents.

node	table_sk	parent_array
customer_activity	prod.analytics.customer_activity	[”customers”, “events”]
events	prod.analytics.events	[”stg_events”]
...	...	...

To build this dataset, there are two options.

Using dbt_snowflake_monitoring

The first option is to use dbt_snowflake_monitoring/dbt_queries.sql, which you should have already have installed for the other required mdoels (query_base_object_access, query_history_enriched). The two main drawbacks of this option are that deleted models will continue to be included for a couple days after they leave the project, and that sources are never included, because they are not "refs".

1select
2    dbt_node_name as node,
3    lower(concat(dbt_target_database, '.', dbt_target_schema, '.', dbt_node_alias)) as table_sk,
4    dbt_node_refs as parent_array
5from dbt_queries
6where
7    start_time > current_date - 3 -- tunes risk of deleted model inclusion
8    and dbt_node_resource_type in ('model', 'snapshot', 'seed')
9    and execution_status = 'SUCCESS'
10    
11    -- [optional] add additional filters if you want to exclude certain environments or projects
12    --    and dbt_node_package_name = <my project>
13    --    and dbt_target_name = <my target>
14    --    and dbt_target_database = <my prod db>
15    --    and dbt_target_schema in  <my prod schemas>

Using dbt_artifacts

The second option is to use dbt_artifacts/dim_dbt__current_models. This is the more robust option, but requires the dbt_artifacts package which has a more complicated set-up process.

1select
2    split_part(node_id, '.', 3) as node,
3    lower(concat(database, '.', schema, '.', name)) as table_sk,
4    depends_on_nodes as parent_array,
5from dim_dbt__current_models
6where
7    -- [optional] filter to specific databases
8    -- database in (<your databases>)

Step 2: Derive node children

Now that we have a list of nodes, we will create a new CTE, node_children by flattening the nodes CTE. This maps out the "first degree parents".

Using dbt_snowflake_monitoring

1with 
2nodes as (
3    
4    select
5        dbt_node_name as node,
6        dbt_node_refs as parent_array,
7        lower(concat(dbt_target_database, '.', dbt_target_schema, '.', dbt_node_alias)) as table_sk,
8        query_id
9    from dbt_queries_select
10    where true 
11        and start_time > current_date - 3 -- tunes risk of deleted model inclusion
12        and dbt_node_resource_type in ('model', 'snapshot', 'seed')
13        and execution_status = 'SUCCESS'
14    qualify row_number() over (partition by dbt_node_name order by end_time desc) = 1
15

Using dbt_artifacts

1with 
2nodes as (
3
4    select
5        -- assume packaged model names do not collide
6        split_part(node_id, '.', 3) as node,
7        lower(concat(database, '.', schema, '.', name)) as table_sk,
8        depends_on_nodes as parent_array
9    from dim_dbt__current_models
10
11),
12
13-- Unpack the parents (refs) array and swap the relationship into node -> descendent terms.
14node_children as (
15

Step 3: Recursively find all model descendants

The rest of the query is the same regardless of whether you're using dbt-snowflake-monitoring or dbt-artifacts. It carries out the following steps:

Derive node_descendants_recursive (all degrees) by recursively joining node_children (from above) to itself
The granularity at this point is "all paths"
Union an additional row for "a node and itself"
Aggregate node_descendants to unique node-descendant pairs

Here's the query assuming you are using dbt-snowflake-monitoring:

1with 
2nodes as (
3    
4    select
5        dbt_node_name as node,
6        lower(concat(dbt_target_database, '.', dbt_target_schema, '.', dbt_node_alias)) as table_sk,
7        dbt_node_refs as parent_array
8    from dbt_queries
9    where
10        start_time > current_date - 3 -- tunes risk of deleted model inclusion
11        and dbt_node_resource_type in ('model', 'snapshot', 'seed')
12        and execution_status = 'SUCCESS'
13        
14        -- [optional] add additional filters if you want to exclude certain environments or projects
15        --    and dbt_node_package_name = <my project>

Head to the appendix for a version of this query you can leverage in your dbt project.

How to query for unused dbt models

With our new dbt_model_descendants model we created that accounts for model dependencies (or should we say descendencies?) we can aggregate direct table usage and distribute it upward through the DAG. This looks like a join of query counts on the descendant side, conditionally aggregated around the parent. The self-edge comes into play here as the conditional aggregation can differentiate direct vs indirect usage by checking if the descendant is actually the node itself.

1with
2table_queries as (
3    select 
4        lower(query_base_object_access.object_name) as table_sk,
5        count(*) as count_queries
6    from query_history_enriched_select
7    inner join query_base_object_access
8        on query_history_enriched_select.query_id = query_base_object_access.query_id
9        and query_history_enriched_select.start_time = query_base_object_access.query_start_time
10    where 
11        query_history_enriched_select.start_time > current_date - 180
12        and query_history_enriched_select.query_type = 'SELECT'
13        and query_history_enriched_select.execution_status = 'SUCCESS'
14        -- exclude dbt queries
15        and dbt_metadata is null

This query tells us how many "usage" queries directly hit each dbt model, as well as how many usage queries are distributed across a model’s downstream descendants. If a model has total_queries = 0 then it’s both not serving direct usage, and not supporting any downstream direct usage. Do note that downstream_queries and total_queries will be higher than your actual total Snowflake query count since a single query can be counted across more than one mode.

What to do with unused models

As an analytics engineer, I know more about building new tables than I do about deleting old ones.

Most dbt models are fixed transformations of raw data that can be switched off and on without "missing" anything. Sure the production model will go stale until the model is turned back on, but there will be irretrievable loss of information. In these cases, simply disabling the model, or deleting it and letting it live on in your git history are both good options. I’d recommend also dropping the table at this stage, just to avoid any users accessing stale data.

Models like dbt snapshots, or other fancy incremental schemes might not fit this description. Deprecating something in this category is going to require more case-specific considerations, but having been in this position, I know it’s also probably likely that nobody knows what the model is for, or even what its intentions are.

How to remove a dbt model from your project

Steps for deleting a model:

Delete the .sql model file.

Ctrl+Shift+F search the model’s name in the whole project to find…
- refs() to the model
- schema or congi .yml references to the model.

Drop the corresponding Snowflake table (or view)

You probably won’t have to update a ref() if you’re using this approach because any model referencing an unused model must also be unused (or else the parent would have downstream usage!). If there is a chain of unused models, I’d recommend starting at the end and working backwards; in A -> B -> C, delete C first!

Disabling a model is a quick way to turn off a model without deleting any code. Disabled models act like they don’t exist, but their code stays in your project; it’s just a config line;

1-- my_unused_model.sql
2{{ config(enabled = false) }}
3
4select ...

This might be the quickest and easiest-to-reverse way to shut off a model, but assuming you’re using git already, you wouldn’t be losing any of the code you delete, either. And assuming you’re concerned with model sprawl, you’re probably better off taking unused models out to the dumpster, rather than declaring a dedicated trash corner.

Lastly, make sure to thank your models for their hard work. As the great Data Engineer Marie Kondo says;

Cherish the [analytical models] that bring you joy, and let go of the rest with gratitude.

Appendix - files for your dbt model project

1{{ config(materialized='table') }}
2
3with 
4nodes as (
5    
6    select
7        dbt_node_name as node,
8        lower(concat(dbt_target_database, '.', dbt_target_schema, '.', dbt_node_alias)) as table_sk,
9        dbt_node_refs as parent_array
10    from {{ ref('dbt_queries') }}
11    where
12        start_time > current_date - 3 -- tunes risk of deleted model inclusion
13        and dbt_node_resource_type in ('model', 'snapshot', 'seed')
14        and execution_status = 'SUCCESS'
15

1version: 2
2models:
3  - name: dbt_model_descendants
4description: >-
5  A table mapping each DAG model node to all of its descendant model nodes. The
6  mapping includes the model's self as a descendant with depth = 0. Sources are not included.
7columns:
8  - name: node_descendant_sk
9    description: Unique identifier of a node-descendant pairing
10    tests:
11      - unique
12      - not_null
13  - name: node
14    description: The name of a node in the DAG
15  - name: descendant

Analyzing your DAG to identify unused dbt models in Snowflake

Why care about unused dbt models?

Understanding dbt model usage

An overview of the approach

Prerequisites

How to model dependencies in your dbt DAG

Step 1: get each model's parents

Using dbt_snowflake_monitoring

Using dbt_artifacts

Step 2: Derive node children

Using dbt_snowflake_monitoring

Using dbt_artifacts

Step 3: Recursively find all model descendants

How to query for unused dbt models

What to do with unused models

How to remove a dbt model from your project

Appendix - files for your dbt model project

Get up and running in less than 15 minutes