In Best Practices for dbt Workflows, Part 1: Concepts & Slim Local Builds, I introduced the concept of a “slim” dbt build, and provided some examples for local dbt development. I also described various “dbt invocation contexts” (Local, CI/CD, and Scheduled).
This post is a followup to Part 1, where we’ll look specifically at strategies for slim dbt builds in a CI/CD invocation context (the middle section in the image above).
“Slim” dbt builds minimize redundant, unnecessary, or erroneous model invocations. When we invoke dbt
, we’re usually trying to achieve one of two things:
Local
or CI/CD
invocation contextScheduled
invocation contextIn both cases, we usually only need to build a smaller subset of the DAG. For any given dbt invocation, there is some minimum set of resources that need to be built to achieve the goal. The goal of slim builds is to get as close as possible to that minimum set, so we don’t waste compute time and money on other resources.
Local dbt builds can be “slimmed down” to target only relevant resources using the --defer
CLI flag, the --empty
CLI flag, or row sampling techniques.
Slim Local Builds
in Best Practices for dbt Workflows, Part 1: Concepts & Slim Local BuildsThe most powerful tool for slim CI/CD builds is the state:modified node selector. This selector allows you to run only the models which have been modified relative to some previous dbt DAG state, as described by a manifest artifact. Typically, this manifest will be the latest production manifest.
Suppose your project has 1000 models, and you open a Pull Request which modifies just 2 of those models. In order to test and validate your changes, you might want to build just those 2 models and their downstream dependencies, instead of the whole DAG. This can be done with the following command
dbt run --select state:modified+ --state .state
dbt will be able to dynamically determine which resources have been modified in the current changeset, relative to the latest production manifest.
dbt now supports some more specific state:modified
selectors, such as state:modified.body
, state:modified.config
, etc. Check out the docs for detailed info on these sub-selectors.
The --fail-fast
CLI flag forces dbt invocations to exit immediately with a 1
exit code as soon as any error is encountered. By default, dbt won’t do this. Instead, it will continue running all other resources which are unaffected by the error, for example unconnected subgraphs, until completion of the job.
This is not always desirable in a CI/CD context, because those extra resource invocations can be costly, plus it will take longer for you to receive feedback about the ultimate status of the build.
The only downside of this approach is that if your project has other errors downstream from the failed resource, you will not find out about them until you resolve the first error which triggered the fail fast, and run another dbt invocation. This means you’ll need repeated dbt invocations to resolve multiple errors. While this is tedious, it can still be a lot more cost efficient than allowing all resources to build every time. Typically I disable the --fail-fast
flag if I have a large changeset which modifies multiple unrelated subgraphs; in these situations, I want to learn about all errors “all at once”, so I can resolve them altogether without multiple subsequent invocations.
In Part 1, I presented the --defer
CLI flag as an option for slimming down local dbt invocations.
dbt run -s my_model --defer --state .state
This can also potentially be used in a CI/CD invocation context, but there are some additional considerations, particularly for workflows that deploy models to production environments.
My team runs two types of dbt CI/CD pipelines:
DBT_CI_MAIN
) and promote them to production in a blue-green swap style.In a blue-green production deployment, the newly built objects contained in the invocation db (DBT_CI_MAIN
in the above diagram) must be “promoted” to a production db, after the dbt invocation completes successfully. The most straightforward way to do this is to swap
the entire database or schema. This is assuming the invocation db gets created from a clone
before dbt
is invoked.
Another option is to start from an empty db and defer upstream object references to the production db. However, if the invocation db starts from empty, then a full database or schema swap will result in lots of missing objects in production. This means low level objects (tables, views, etc) must be promoted individually via create or replace ... clone
or swap
statements.
While this is doable, it is also somewhat complex to implement at scale. For this reason, the full database cloning approach is probably most suitable for scheduled builds or main
branch CI/CD deployments. --defer
is most suitable for local invocations, and can be used for some, but not all, CI/CD invocations (namely, “PR” builds, where the resulting objects don’t need to be promoted to production).
This post reviewed three techniques for slim builds in CI/CD pipelines - the state:modified
selector, the --fail-fast
CLI flag, and the --defer
CLI flag. Each of these techniques can reduce the set of models that must be built as part of dbt CI/CD pipelines.
We also looked at some possible CI/CD architectures, and the pros and cons of using --defer
vs object clones for upstream model references in CI/CD pipelines.
Continue with Part 3 of this series, Best Practices for dbt Workflows, Part 2: Slim Scheduled Builds, where we’ll dive into strategies for slim Scheduled builds.
Get up and running with SELECT in 15 minutes.
Gain visibility into Snowflake usage, optimize performance and automate savings with the click of a button.