Monitoring dbt model spend and performance with metadata
- Date
- Ian WhitestoneCo-founder & CEO of SELECT
In a previous post, we covered how Snowflake query tags & comments allow users to associate arbitrary metadata with each query. In this post, we show how you can add query tags or comments to your dbt models in order to track their spend or performance over time.
Why track dbt model spend & performance?
dbt has skyrocketed in popularity over the last 5 years, becoming the most popular framework for building and managing data models within the data warehouse. Performing lots of data transformations on large datasets within the warehouse is not cheap, however. Whether using dbt, or any other SQL-based transformation tool, the costs associated with these transformations tend to make up a significant portion of Snowflake customer's compute spend.
As customers look to better understand, monitor, and reduce their data cloud spend, it has become increasingly important to get more insight into the spend associated with each dbt model. Additionally, as customers increasingly use dbt to power business critical applications and decision making, it becomes necessary for customers to monitor model performance in order to ensure that SLAs are met.
When using Snowflake and dbt, customers do not get these crucial monitoring features out of the box. By adding metadata to their dbt models through query tags or comments, customers can achieve these core monitoring abilities.
Setting query tags in dbt
In our post on query tags, we outlined the three options for setting query tags in dbt:
- Setting it globally in your
profiles.yml
- Adding a
query_tag
for each model in yourdbt_project.yml
or in the model config - Creating a
set_query_tag
macro to dynamically set the query tag for each model in your project.
Approach #3 is by far the best option as it avoids having users manually set the tags. If you'd like to get started with dynamically setting query tags for each model, you can implement a custom macro like the one here to add detailed metadata to each query issued by dbt.
Setting query comments in dbt (recommended approach)
For dbt related metadata, we recommend using query comments instead of query tags since the computer generated metadata can occasionally exceed the 2000 character limit of query tags.
dbt provides this setting out of the box. In your dbt_project.yml
, you can add the following:
query-comment:
append: true # Snowflake removes prefixed comments
This will add a query comment to the bottom of your query:
create view analytics.analytics.orders as (
select ...
);
/* {"app": "dbt", "dbt_version": "0.15.0rc2", "profile_name": "debug",
"target_name": "dev", "node_id": "model.dbt2.my_model"} */
In order to add more comprehensive metadata in the query comment, you can install our dbt-snowflake-monitoring query package. This package makes the following metadata available for all dbt queries:
{
"dbt_snowflake_query_tags_version": "2.0.0",
"app": "dbt",
"dbt_version": "1.4.0",
"project_name": "my_project",
"target_name": "dev",
"target_database": "dev",
"target_schema": "larry_goldings",
"invocation_id": "c784c7d0-5c3f-4765-805c-0a377fefcaa0",
"node_name": "orders",
"node_alias": "orders",
"node_package_name": "my_project",
"node_original_file_path": "models/staging/orders.sql",
"node_database": "dev",
"node_schema": "mart",
"node_id": "model.my_project.orders",
"node_resource_type": "model",
"materialized": "incremental",
"is_incremental": true,
"node_refs": [
"raw_orders",
"product_mapping"
],
"dbt_cloud_project_id": "146126",
"dbt_cloud_job_id": "184124",
"dbt_cloud_run_id": "107122910",
"dbt_cloud_run_reason_category": "other",
"dbt_cloud_run_reason": "Kicked off from UI by [email protected]",
}
Using this info, you can can monitor cost and performance by a variety of interesting dimensions, such as dbt project, model name, environment (dev or prod), materialization type, and more.
Monitoring dbt model performance
When using query tags, you can monitor your dbt model performance using a variation of the query below:
select
date_trunc('day', start_time) as date,
try_parse_json(query_tag)['model_name']::string as model_name,
count(distinct try_parse_json(query_tag)['invocation_id']) as num_executions,
avg(total_elapsed_time/1000) as avg_total_elapsed_time_s,
approx_percentile(total_elapsed_time/1000, 0.95) as p95_total_elapsed_time_s,
avg(execution_time/1000) as avg_execution_time_s,
approx_percentile(execution_time/1000, 0.95) as p95_execution_time_s
-- optionally repeat for other query time metrics, like:
-- compilation_time, queued_provisioning_time, queued_overload_time, etc.
from snowflake.account_usage.query_history
where
try_parse_json(query_tag)['app']::string = 'dbt'
and start_time > current_date - 30
group by 1,2
If you're using query comments, you'll first have to parse out the metadata from the comment text:
with
query_history as (
select
*,
regexp_substr(query_text, '/\\*\\s({"app":\\s"dbt".*})\\s\\*/', 1, 1, 'ie') as _dbt_json_meta,
try_parse_json(_dbt_json_meta) as dbt_metadata
from snowflake.account_usage.query_history
)
select
date_trunc('day', start_time) as date,
dbt_metadata['model_name']::string as model_name,
count(distinct dbt_metadata['invocation_id']) as num_executions,
avg(total_elapsed_time/1000) as avg_total_elapsed_time_s,
approx_percentile(total_elapsed_time/1000, 0.95) as p95_total_elapsed_time_s,
avg(execution_time/1000) as avg_execution_time_s,
approx_percentile(execution_time/1000, 0.95) as p95_execution_time_s
-- optionally repeat for other query time metrics, like:
-- compilation_time, queued_provisioning_time, queued_overload_time, etc.
...
from query_history
where
dbt_metadata is not null
and start_time > current_date - 30
group by 1,2
dbt-snowflake-monitoring to the rescue!
If you're using our dbt-snowflake-monitoring package, this query comment parsing is automatically done for you.
Both queries above only look at the query run times. A number of other metrics you can monitor in conjunction are:
partitions_scanned
&partitions_total
: informs you whether the queries are efficiently pruning out micro-partitionsbytes_scanned
: gives you an idea of how much data is being processed over time, which may explain increased run timesbytes_spilled_to_local_storage
&bytes_spilled_to_remote_storage
: indicates whether your model may benefit from running on a larger warehousequeued_overload_time
: indiciates whether you may need to increase youmax_cluster_count
on the warehouse
Monitoring dbt model spend
In order to monitor dbt model costs over time, first calculate the cost of each Snowflake query. Next, the query costs can be used to aggregate spend by model:
select
date_trunc('day', start_time) as date,
try_parse_json(query_tag)['model_name']::string as model_name,
count(distinct try_parse_json(query_tag)['invocation_id']) as num_executions,
sum(query_cost) as total_cost,
total_cost / num_executions as avg_cost_per_execution
from query_history_w_costs
where
try_parse_json(query_tag)['app']::string = 'dbt'
and start_time > current_date - 30
group by 1,2
If you'd like to automatically get a query history with the costs for each query added, you can install our dbt-snowflake-monitoring package.
Use SELECT for dbt model cost & performance monitoring
At SELECT, we leverage query tags & comments to allow our customers to monitor their dbt-issued queries by a variety of dimensions: environment, materialization, resource type, etc. Gaining this visibility is incredibly valuable, as it allows data teams to easily prioritize which dbt models require some extra attention. An example of how we surface this information in the product is shown below:
If you're looking to cut your Snowflake costs, want to get a better picture of what's driving them, or just want to keep a pulse on things, you can get access today or book a demo using the links below.