Monitoring dbt model spend and performance with metadata

Date: Friday, February 24, 2023

Ian Whitestone
Co-founder & CEO of SELECT

In a previous post, we covered how Snowflake query tags & comments allow users to associate arbitrary metadata with each query. In this post, we show how you can add query tags or comments to your dbt models in order to track their spend or performance over time.

Why track dbt model spend & performance?

dbt has skyrocketed in popularity over the last 5 years, becoming the most popular framework for building and managing data models within the data warehouse. Performing lots of data transformations on large datasets within the warehouse is not cheap, however. Whether using dbt, or any other SQL-based transformation tool, the costs associated with these transformations tend to make up a significant portion of Snowflake customer's compute spend.

As customers look to better understand, monitor, and reduce their data cloud spend, it has become increasingly important to get more insight into the spend associated with each dbt model. Additionally, as customers increasingly use dbt to power business critical applications and decision making, it becomes necessary for customers to monitor model performance in order to ensure that SLAs are met.

When using Snowflake and dbt, customers do not get these crucial monitoring features out of the box. By adding metadata to their dbt models through query tags or comments, customers can achieve these core monitoring abilities.

Setting query tags in dbt

In our post on query tags, we outlined the three options for setting query tags in dbt:

Setting it globally in your profiles.yml
Adding a query_tag for each model in your dbt_project.yml or in the model config
Creating a set_query_tag macro to dynamically set the query tag for each model in your project.

Approach #3 is by far the best option as it avoids having users manually set the tags. If you'd like to get started with dynamically setting query tags for each model, you can implement a custom macro like the one here to add detailed metadata to each query issued by dbt.

Setting query comments in dbt (recommended approach)

For dbt related metadata, we recommend using query comments instead of query tags since the computer generated metadata can occasionally exceed the 2000 character limit of query tags.

dbt provides this setting out of the box. In your dbt_project.yml, you can add the following:

query-comment: append: true # Snowflake removes prefixed comments

This will add a query comment to the bottom of your query:

create view analytics.analytics.orders as ( select ... ); /* {"app": "dbt", "dbt_version": "0.15.0rc2", "profile_name": "debug", "target_name": "dev", "node_id": "model.dbt2.my_model"} */

In order to add more comprehensive metadata in the query comment, you can install our dbt-snowflake-monitoring query package. This package makes the following metadata available for all dbt queries:

{ "dbt_snowflake_query_tags_version": "2.0.0", "app": "dbt", "dbt_version": "1.4.0", "project_name": "my_project", "target_name": "dev", "target_database": "dev", "target_schema": "larry_goldings", "invocation_id": "c784c7d0-5c3f-4765-805c-0a377fefcaa0", "node_name": "orders", "node_alias": "orders", "node_package_name": "my_project", "node_original_file_path": "models/staging/orders.sql", "node_database": "dev", "node_schema": "mart", "node_id": "model.my_project.orders", "node_resource_type": "model", "materialized": "incremental", "is_incremental": true, "node_refs": [ "raw_orders", "product_mapping" ], "dbt_cloud_project_id": "146126", "dbt_cloud_job_id": "184124", "dbt_cloud_run_id": "107122910", "dbt_cloud_run_reason_category": "other", "dbt_cloud_run_reason": "Kicked off from UI by [email protected]", }

Using this info, you can can monitor cost and performance by a variety of interesting dimensions, such as dbt project, model name, environment (dev or prod), materialization type, and more.

Monitoring dbt model performance

When using query tags, you can monitor your dbt model performance using a variation of the query below:

select date_trunc('day', start_time) as date, try_parse_json(query_tag)['model_name']::string as model_name, count(distinct try_parse_json(query_tag)['invocation_id']) as num_executions, avg(total_elapsed_time/1000) as avg_total_elapsed_time_s, approx_percentile(total_elapsed_time/1000, 0.95) as p95_total_elapsed_time_s, avg(execution_time/1000) as avg_execution_time_s, approx_percentile(execution_time/1000, 0.95) as p95_execution_time_s -- optionally repeat for other query time metrics, like: -- compilation_time, queued_provisioning_time, queued_overload_time, etc. from snowflake.account_usage.query_history where try_parse_json(query_tag)['app']::string = 'dbt' and start_time > current_date - 30 group by 1,2

If you're using query comments, you'll first have to parse out the metadata from the comment text:

with query_history as ( select *, regexp_substr(query_text, '/\\*\\s({"app":\\s"dbt".*})\\s\\*/', 1, 1, 'ie') as _dbt_json_meta, try_parse_json(_dbt_json_meta) as dbt_metadata from snowflake.account_usage.query_history ) select date_trunc('day', start_time) as date, dbt_metadata['model_name']::string as model_name, count(distinct dbt_metadata['invocation_id']) as num_executions, avg(total_elapsed_time/1000) as avg_total_elapsed_time_s, approx_percentile(total_elapsed_time/1000, 0.95) as p95_total_elapsed_time_s, avg(execution_time/1000) as avg_execution_time_s, approx_percentile(execution_time/1000, 0.95) as p95_execution_time_s -- optionally repeat for other query time metrics, like: -- compilation_time, queued_provisioning_time, queued_overload_time, etc. ... from query_history where dbt_metadata is not null and start_time > current_date - 30 group by 1,2

If you're using our dbt-snowflake-monitoring package, this query comment parsing is automatically done for you.

Both queries above only look at the query run times. A number of other metrics you can monitor in conjunction are:

partitions_scanned & partitions_total: informs you whether the queries are efficiently pruning out micro-partitions
bytes_scanned: gives you an idea of how much data is being processed over time, which may explain increased run times
bytes_spilled_to_local_storage & bytes_spilled_to_remote_storage: indicates whether your model may benefit from running on a larger warehouse
queued_overload_time: indiciates whether you may need to increase you max_cluster_count on the warehouse

Monitoring dbt model spend

In order to monitor dbt model costs over time, first calculate the cost of each Snowflake query. Next, the query costs can be used to aggregate spend by model:

select date_trunc('day', start_time) as date, try_parse_json(query_tag)['model_name']::string as model_name, count(distinct try_parse_json(query_tag)['invocation_id']) as num_executions, sum(query_cost) as total_cost, total_cost / num_executions as avg_cost_per_execution from query_history_w_costs where try_parse_json(query_tag)['app']::string = 'dbt' and start_time > current_date - 30 group by 1,2

If you'd like to automatically get a query history with the costs for each query added, you can install our dbt-snowflake-monitoring package.

Use SELECT for dbt model cost & performance monitoring

At SELECT, we leverage query tags & comments to allow our customers to monitor their dbt-issued queries by a variety of dimensions: environment, materialization, resource type, etc. Gaining this visibility is incredibly valuable, as it allows data teams to easily prioritize which dbt models require some extra attention. An example of how we surface this information in the product is shown below:

SELECT dbt model cost & performance monitoring overview

If you're looking to cut your Snowflake costs, want to get a better picture of what's driving them, or just want to keep a pulse on things, you can get access today or book a demo using the links below.