Best Practices for dbt Workflows, Part 3: Slim Scheduled Builds
- Date
- Alex CarusoLead Data Platform Engineer at Entera
In Best Practices for dbt Workflows, Part 1: Concepts & Slim Local Builds, I introduced the concept of a “slim” dbt build, and provided some examples for local dbt development. I also described various “dbt invocation contexts” (Local, CI/CD, and Scheduled).
In Best Practices for dbt Workflows, Part 2: Slim CI/CD Builds, I described some techniques for achieving slim builds in CI/CD pipelines.
In this post, we’ll look at the final invocation context on the right hand side: Scheduled
, and work through some strategies for slim builds. I’ll close out with some discussion on other considerations for dbt builds across all three invocation contexts.
Part 1 Recap: Local dbt Slim Builds
“Slim” dbt builds minimize redundant, unnecessary, or erroneous model invocations. When we invoke dbt
, we’re usually trying to achieve one of two things:
- Building, testing, and validating code / logic changes to dbt resources (models, tests, etc)
Local
orCI/CD
invocation context
- Refreshing models as new source data comes in
Scheduled
invocation context
In both cases, we usually only need to build a smaller subset of the DAG. For any given dbt invocation, there is some minimum set of resources that need to be built to achieve the goal. The goal of slim builds is to get as close as possible to that minimum set, so we don’t waste compute time and money on other resources.
Local dbt builds can be “slimmed down” to target only relevant resources using the --defer
CLI flag, the --empty
CLI flag, or row sampling techniques.
Part 2 Recap: Slim CI/CD
Slim CI/CD builds can be achieved with the state:modified+
selector or the --fail-fast
CLI flag. --defer
can also be used in some, but not all, CI/CD contexts.
Prerequisites
- You’ve read up until
Slim Local Builds
in Best Practices for dbt Workflows, Part 1: Concepts & Slim Local Builds - You’ve setup dbt artifact persistence, as described in Part 1
- You have a functioning orchestration pipeline to invoke scheduled dbt builds against an isolated database or schema in your destination environment
Slim Scheduled Builds
In a scheduled dbt invocation context, there are no code changes occurring, only refreshes to source data. This means the decisions about which models to build for a given invocation are all dependent on only the source data refresh frequency and downstream model exposure SLAs.
source_status:fresher+ Selector
dbt is able to capture the “freshness” of your source data tables using the dbt source freshness command. The “freshness” of a given table describes when it was last updated, typically based on a timestamp column, and the diff between the maximum value of that column and some SLA threshold, e.g. daily, weekly, etc.
If sources are configured as described in the docs linked above (with a freshness
and loaded_at_field
property), then the dbt source freshness
command will generate a sources.json
artifact which stores the max_loaded_at
value for each source table. This can be used in conjunction with the source_status:fresher+
selector to only select sources which have been updated relative to the last time this artifact was generated.
This is useful for “slimming down” scheduled builds to ignore models downstream from sources which haven’t had any updates since the last dbt invocation. For example, suppose you have a large datasource that only updates once weekly, but you also run a scheduled nightly dbt build. Models downstream from this source should be skipped 6 out of every 7 days, unless they also depend on one or more sources which refresh more frequently than weekly.
Example
Suppose you have two sources, one that updates daily and another weekly, and a few downstream models. After running dbt source freshness
, your source.json
artifact looks like this:
[
{
"unique_id": "source.projectname.sourcename.daily_source",
"max_loaded_at": "2025-01-03T12:00:00.000000+00:00",
...
},
{
"unique_id": "source.projectname.sourcename.weekly_source",
"max_loaded_at": "2025-01-01T00:00:00.000000+00:00",
...
}
]
The daily_source
last updated on January 3 at noon, and the weekly_source
last updated on January 1 at midnight. If we run dbt run -s source_status:fresher+ --state .state
on January 4th against the below subgraph, the model downstream from the weekly source should get skipped.
Considerations
- When to run
dbt source freshness
?- This command should be run before each new scheduled dbt invocation. The resulting
sources.json
artifact should be persisted, so subsequent scheduled dbt build invocations can download it to disk and utilize it forsource_status:fresher+
selector.
- This command should be run before each new scheduled dbt invocation. The resulting
- Make sure all sources have defined
freshness
andloaded_at_field
config- If they don’t, dbt will ignore these sources during the
dbt source freshness
command, and not generate anymax_loaded_at
timestamp metadata insources.json
artifact. Then these sources will be completely ignored by thesource_status:fresher+
selector - It’s a bit tedious to ensure all sources have this config - my team enforces it using a macro that inspects all sources and asserts that these attributes are defined.
- If they don’t, dbt will ignore these sources during the
Model Tagging
Another way to achieve slim scheduled builds is with model tagging. Rather than running all models on every scheduled build, we can run only models or subgraphs associated with a tag. A convenient way to do this in a scheduled invocation context is to use tags named after some refresh SLA, like daily
, weekly
, or monthly
.
For example, suppose we have some models serving reports with a weekly refresh SLA. They need to be fresh at the start of the business day on Monday mornings. Even though the refresh SLA is weekly, the underlying source data actually updates daily.
We can’t use the source_status:fresher+
selector to skip these models, because their sources refresh more frequently than necessary for the downstream model SLA.
Example
Consider the following dbt build command:
dbt build -s tag:weekly+
This will run all models tagged as weekly
and their downstream children. Downstreams must also be run to ensure any models that depend on both weekly
data and daily
data for example are also refreshed. If we’re running weekly models as a dedicated subgraph, we should also exclude these models from daily builds, so they don’t get repeatedly re-built unnecessarily.
dbt build -s tag:daily+ --exclude tag:weekly
Note: you don’t necessarily need to tag every model in your project with a refresh frequency. You can just leave your “baseline” models which typically run most frequently (say daily), untagged and build them without any tag selector. In the above example, I used tag:daily
for clarity, but if this selector is in place, the --exclude tag:weekly
isn’t actually needed. However, if “daily” is your typical baseline build frequency, and your “daily” models are untagged, then an explicit exclusion of weekly models is needed.
Considerations
- You might be wondering “why can’t I just use an incremental materialization for this?” That is one potential solution, but it’s not as general purpose as using tags. Some models might be
table
materializations and too complex to easily incrementalize. They may also have non-deterministic logic that makes incrementalization impossible. - Don’t conflate source refresh frequency with model refresh SLAs
- Just because a source only refreshes weekly, does not mean it’s downstream models should only run weekly too. A cleaner solution is to attempt to run them every day, using the
source_status:fresher+
selector, or an incremental materialization. This way they are guaranteed to be updated upon the first dbt invocation after a source refresh. Otherwise, there might be a multi-day “lag” between when a weekly source updates, and the corresponding weekly models build, which is not ideal.
- Just because a source only refreshes weekly, does not mean it’s downstream models should only run weekly too. A cleaner solution is to attempt to run them every day, using the
- Overlapping model selectors and redundant model rebuilds
- You might notice in the diagrams above that the downstream
untagged
model still gets selected and built under both dbt invocations commands. This is redundant! - Don’t worry, despite there being some redundancy here, this is still a major improvement over running
weekly
tagged models every day. The tradeoff is we must redundantly re-build the downstreamuntagged
model once per week (for the daily build, then again for the weekly). Of course, this cost-benefit analysis depends on the relative costs of the untagged model vs the weekly model. - This can be improved further with more sophisticated selector patterns and DAG architectures, but I won’t get into that here.
- You might notice in the diagrams above that the downstream
Wrap Up: Putting it All Together
It’s possible to compose several of the CLI flags and state based selectors discussed in this series to achieve ultra slim builds in CI/CD or Scheduled invocation context.
CI/CD Context
“Build all modified models and their downstream children, and defer upstream references to production DBT_PROD
db. Fail fast when an error is encountered.”
dbt build --fail-fast --defer --select state:modified+ --state .state
Scheduled Context
“Build all models downstream from sources with record updates relative to previous build, if those models are tagged as daily
. Defer upstream references to production DBT_PROD
db, and fail fast when an error is encountered.”
dbt build --fail-fast --defer --select source_status:fresher+,tag:daily
Other Considerations
Object Clones vs Deferred References via --defer
In both Part 1 and Part 2 of this series, I covered the idea that some of the benefits of the --defer
CLI flag can actually be achieved with zero copy cloned objects. Instead of working from an empty invocation database and deferring upstream references to a production database, those objects can be cloned into the invocation db before invoking dbt and referenced normally.
This strategy offers more protection against race conditions, because the “critical” section during which the production objects are “snapshotted” (cloned) is much smaller than references via --defer
. If deferring references to another db, you run the risk of those objects in the deferral db changing midway through your build due to an out of band deployment. Clones can be a better strategy for “critical” dbt invocations, such as invocations associated with main
branch CI/CD builds (merge / deploy).
The cloning strategy has some downsides too, namely the fact that clones must be recreated every time source objects change, and complications with RBAC in Snowflake.
Materialization Strategies
Materialization strategy is another critical component of slim builds. This is a whole topic in itself, so I won’t go into detail here. With this said, teams should be aware that excessive full refresh builds, particularly in CI/CD contexts, is a major source of redundant model builds and spend. Usage of the --full-refresh
flag, as well as decisions between incremental
and table
materializations, should be carefully deliberated in each invocation context.
Wrap Up
There are seemingly endless strategies to invoke and deploy dbt models. While this flexibility is great to have, it also puts the responsibility on developers to make sure models are not excessively built and rebuilt. This is especially important when data warehouse solutions like Snowflake make it really easy to go overboard on compute spend. Hopefully the techniques reviewed in this post leave you more prepared to manage your dbt invocations and keep your builds slim moving forward.