Best Practices for dbt Workflows, Part 1: Concepts & Slim Local Builds
- Date
- Alex CarusoLead Data Platform Engineer at Entera
dbt has become a cornerstone for many data teams, enabling modular and consistent data transformation workflows to power BI and ML products. As dbt projects scale to hundreds or even thousands of models, ensuring efficient builds becomes critical to keep costs down and performance up. There are numerous ways to optimize dbt projects and keep them lean, ranging from model-level optimizations to architecture choices that encourage efficient builds. This is the first post of a 3 part series that will be a deep dive into the latter, specifically strategies for achieving “slim” dbt builds. This post will introduce some high level concepts and then review various hands-on examples for efficient local dbt development.
What are dbt Slim Builds?
“Slim” dbt builds are about minimizing redundant, unnecessary, or erroneous model invocations as aggressively as possible, within the constraints of your execution environment. The phrase “slim CI” was popularized by dbt itself back in 2021 with the introduction of the state:modified
node selector. This selector is one powerful technique for eliminating unnecessary node invocations, but today there are many more options to explore for keeping your dbt project lean and performant.
The image below demonstrates the core concept behind slim builds; based on some selection criteria of your choosing, only a subset of total models get rebuilt in a given dbt invocation. This might be models whose code was modified, which have updated sources, or which meet a number of other criteria.
Why Focus on Orchestration Versus Other Optimizations?
One thing I've often observed throughout discussions on performance profiles and cost footprints of dbt projects is a focus on “low level” optimizations and considerations. Examples of such considerations might be query profiles for individual models, clustering keys, warehouse right-sizing, or model incrementalization. All of these aspects are important in maintaining a performant and cost-efficient dbt project.
However, one aspect that I believe to be commonly overlooked is the orchestration considerations around running dbt models. Specifically, the conditions under which and the frequency with which models are built and re-built are important considerations. Ideally, models are re-built as infrequently as possible, and redundant or unnecessary model invocations are either completely eliminated or altered to minimize runtime and compute cost incurred.
If your dbt compute costs are out of control, it might not be because your models are non-performant. You may just be re-building them too much!
Before diving into some strategies for achieving slim builds, let’s review some assumptions and possible architectures which may have implications for the concepts in this series.
Assumptions
For the sake of consistency throughout this series, let’s assume the following:
- Use of
dbt core
, rather thandbt cloud
- Warehouses are right-sized; that is, no undersized warehouses in use for large models which draw out runtimes
- Standard batch processing pipelines with daily, weekly, or monthly refresh rates. We’ll ignore less common patterns involving microbatches, streams, or dynamic tables
- Database-level isolation across disparate dbt invocation contexts
dbt Invocation Contexts
Most organizations generally run dbt in three different “invocation contexts”:
- Local development (e.g. running dbt from your laptop)
- CI/CD pipelines (e.g. GitHub Actions jobs to test code changes before pushing to production)
- Scheduled builds, typically orchestrated by tools like Airflow, Dagster, or Prefect
Each of these invocation contexts usually have one or more dedicated databases or schemas in which models can be built in an isolated way. The strategies for slim builds described in this series can provide benefits in each of these contexts. For simplicity, let’s assume the following architecture:
- Local dbt invocations are built in a dedicated
DBT_DEV
database, with one schema per developer - CI/CD dbt invocations are built in a dedicated
<GIT BRANCH NAME>
database, named after the git branch for your changeset / PR. For example,ACARUSO__CREATE_DIM_ORDERS_MODEL
- Scheduled dbt invocations are built in a dedicated database, named after the schedule. For example,
DBT_DAILY
Upstream dependencies for models built in each of these contexts can referenced either as clones or via deferrals. For CI/CD and scheduled production builds, assume models get promoted to a production db after a successful dbt invocation in a blue-green style.
Example Setup: Persisting dbt Artifacts
Before jumping into specific examples of slim selection criteria and build techniques, let’s run through a simple approach for persisting dbt artifacts, which will be needed for some slim selection techniques. In particular, we’ll need to persist the manifest.json and sources.json files. These files get auto-generated to target/<artifact>.json
upon running various dbt commands such as dbt run
, dbt build
, or dbt source freshness
.
In a CI/CD invocation context, you typically won’t be running jobs on a machine with persistent storage across builds. So I like to store artifacts in remote external object storage such as AWS S3.
(after a successful production invocation of dbt run
or dbt build
):
cd target
aws s3 cp manifest.json s3://dbt-artifacts/manifest.json
(after a successful production invocation of dbt source freshness
):
cd target
aws s3 cp sources.json s3://dbt-artifacts/sources.json
Now, you can download these files into a directory named .state
before any dbt invocation where you’re utilizing state based selectors.
aws s3 sync s3://dbt-artifacts .state
With these files in place, dbt has the information it needs to determine which nodes to run for an invocation with a given set of state selectors.
Slim Local Builds
Let’s now dive into the different options you have for running efficient dbt builds locally.
--defer CLI Flag for Re-Using Pre-Built Production Objects
Rather than rebuilding upstream dependencies from scratch every time you invoke dbt locally, instead you can “defer” references to another database using the --defer CLI flag. This saves you from waiting on long running, expensive builds in a local development context. If the upstreams for the models you’re working on are already built in some production database, and you don’t need to make any changes, you might as well re-use them in a development context.
Suppose our DAG looks like this:
Now imagine you’re making changes to model_c
and running dbt locally to test them out. You need the latest data in model_a
and model_b
in order to test your changes, but they are huge, expensive models that would take hours to build on a development warehouse. This is a perfect case for --defer
. Instead of running
dbt run -s +model_c
try:
dbt run -s model_c --defer --state .state
This will “defer” the references to those models to whichever db is defined in your .state/manifest.json
file, which should be your production database, as we setup above. Now your compiled SQL for model_c
might look like:
create or replace transient table dbt_dev.acaruso.model_c as
with
a as (select * from dbt_prod.schema.model_a),
b as (select * from dbt_prod.schema.model_b),
final as (select * from a inner join b using (id))
select * from final;
Notice how the fully qualified references to model_a
and model_b
use the dbt_prod
database, and not the dbt_dev
database, because we never rebuilt those in dev. Note: you should always work from an empty schema with --defer
, to ensure references actually get deferred. If the refs already exist in your target schema, dbt will use those instead of deferring.
The same benefits can be achieved using object clones, such as those created by Snowflake’s Zero Copy Clone feature. However, clones need to be recreated every time objects in the source db are updated, and not all data warehouse systems support zero copy cloning. So it is a little more straightforward and general to use --defer
for local development use cases.
Bonus: Wrapper Script for --defer
Instead of having to fetch your remote manifest artifact and type out --defer --state /path/to/your/state
every time, I wrote a little wrapper script to simplify things. I host my dbt docs site from the same s3 bucket where I persist artifacts after builds, so this is an easy way to fetch the latest prod artifacts before running dbt invocations with --defer
.
dbtdefer.sh
#!/bin/bash -ue
prod_manifest_uri="http://dbt.yourdomain/manifest.json"
target_dir="path/to/your/dbt/project/target/prod_manifest"
mkdir -p $target_dir
target_file="${target_dir}/manifest.json"
# this writes the contents of "$prod_manifest_uri" to "$target_file", if
# "$prod_manifest_uri" has been modified more recently than "$target_file"
curl -o "$target_file" -z "$target_file" "$prod_manifest_uri"
set -x
dbt $@ --defer --state "$target_dir"
chmod +x dbtdefer.sh
echo 'alias dbtdefer="/path/to/your/dbt/project/bin/dbt_defer/dbtdefer.sh"' >> ~/.bashrc
source ~/.bashrc
Now you can fetch the latest manifest and defer references with a single command: dbtdefer run -s <your selectors>
. It works with dbtdefer build
too, or any other dbt command where you might want to defer references.
--empty CLI Flag for Schema-Only Dry Runs
Another tool for slim local builds is the --empty CLI flag. Adding this option to dbt run
will result in a limit 0
statement getting injected into all sources and refs in your model.
Instead of:
with my_expensive_transform as (
select ...
from {{ ref('source', 'my_big_table') }}
)
...
dbt will submit this to destination:
with my_expensive_transform as (
select ...
from {{ ref('source', 'my_big_table') }}
limit 0
)
...
This is a subtle difference, but can make a big impact for complex queries with expensive transformations or large base tables. dbt will still execute the model SQL against the target data warehouse but will avoid expensive reads and transformations of input data. This validates dependencies and ensures your models will build properly, but without actually building them. This is useful for validating schema changes in local development, where you might actually build the models with full input data in another environment with larger compute resources.
Row Sampling
This is another creative way to limit the amount of data getting built in development invocation contexts (local, or CI/CD). The idea is to limit or filter models with large rowcounts in development, so transforms against those small data subsets can be fast. For example:
with my_big_ref as (
select * from {{ ref('my_big_table') }}
limit 500
-- OR
-- where updated_at >= DATEADD(week, -1, CURRENT_DATE)
-- etc
)
... more transforms go here ...
Such data sampling techniques can be packaged into a ref
macro override, so sampling automatically gets applied depending on the dbt target
.
{% macro ref(model_name) %}
{%- set original_ref = builtins.ref(model_name) -%}
{%- if target.name == 'dev' -%}
(
select *
from {{ original_ref }}
limit 1000
)
{%- else -%}
{{ original_ref }}
{%- endif -%}
{% endmacro %}
The same can be done for the source
macro. Please be aware that overriding and changing the behavior of builtins can cause incompatibility issues with other macros, for example dbt_utils.star()
.
It looks like dbt is working on some native features for this, potentially exposed via a --sample
CLI flag.
- https://github.com/dbt-labs/dbt-core/issues/8378
- https://github.com/dbt-labs/dbt-core/issues/11226
Personally, I never found this approach to be very friendly from a development experience perspective. Samples need to take into account referential integrity and unevenly distributed data to be useful and reliable when testing, and this is a lot harder than it sounds with a complex DAG. Nonetheless, I thought I’d mention it, especially since dbt might soon provide native support for it.
Wrap Up
This post introduced the core concepts behind slim dbt builds and some hands-on examples for achieving slim builds in a local dbt invocation context. We also setup dbt artifact persistence reviewed several different dbt “invocation contexts” in which slim build techniques can be applied. Continue with Part 2 of this series, Best Practices for dbt Workflows, Part 2: Slim CI/CD Builds, where we’ll dive into slim CI/CD strategies.