Best Practices for dbt Workflows, Part 2: Slim CI/CD Builds
- Date
- Alex CarusoLead Data Platform Engineer at Entera
In Best Practices for dbt Workflows, Part 1: Concepts & Slim Local Builds, I introduced the concept of a “slim” dbt build, and provided some examples for local dbt development. I also described various “dbt invocation contexts” (Local, CI/CD, and Scheduled).
This post is a followup to Part 1, where we’ll look specifically at strategies for slim dbt builds in a CI/CD invocation context (the middle section in the image above).
Part 1 Recap: Local dbt Slim Builds
“Slim” dbt builds minimize redundant, unnecessary, or erroneous model invocations. When we invoke dbt
, we’re usually trying to achieve one of two things:
- Building, testing, and validating code / logic changes to dbt resources (models, tests, etc)
Local
orCI/CD
invocation context
- Refreshing models as new source data comes in
Scheduled
invocation context
In both cases, we usually only need to build a smaller subset of the DAG. For any given dbt invocation, there is some minimum set of resources that need to be built to achieve the goal. The goal of slim builds is to get as close as possible to that minimum set, so we don’t waste compute time and money on other resources.
Local dbt builds can be “slimmed down” to target only relevant resources using the --defer
CLI flag, the --empty
CLI flag, or row sampling techniques.
Prerequisites
- You’ve read up until
Slim Local Builds
in Best Practices for dbt Workflows, Part 1: Concepts & Slim Local Builds - You’ve setup dbt artifact persistence, as described in Part 1
- You have a functioning CI/CD pipeline to invoke dbt against an isolated database or schema in your destination environment
Slim CI/CD Builds
state:modified Selector
The most powerful tool for slim CI/CD builds is the state:modified node selector. This selector allows you to run only the models which have been modified relative to some previous dbt DAG state, as described by a manifest artifact. Typically, this manifest will be the latest production manifest.
Suppose your project has 1000 models, and you open a Pull Request which modifies just 2 of those models. In order to test and validate your changes, you might want to build just those 2 models and their downstream dependencies, instead of the whole DAG. This can be done with the following command
dbt run --select state:modified+ --state .state
dbt will be able to dynamically determine which resources have been modified in the current changeset, relative to the latest production manifest.
dbt now supports some more specific state:modified
selectors, such as state:modified.body
, state:modified.config
, etc. Check out the docs for detailed info on these sub-selectors.
--fail-fast CLI Flag For Early Exits & Fast Feedback
The --fail-fast
CLI flag forces dbt invocations to exit immediately with a 1
exit code as soon as any error is encountered. By default, dbt won’t do this. Instead, it will continue running all other resources which are unaffected by the error, for example unconnected subgraphs, until completion of the job.
This is not always desirable in a CI/CD context, because those extra resource invocations can be costly, plus it will take longer for you to receive feedback about the ultimate status of the build.
The only downside of this approach is that if your project has other errors downstream from the failed resource, you will not find out about them until you resolve the first error which triggered the fail fast, and run another dbt invocation. This means you’ll need repeated dbt invocations to resolve multiple errors. While this is tedious, it can still be a lot more cost efficient than allowing all resources to build every time. Typically I disable the --fail-fast
flag if I have a large changeset which modifies multiple unrelated subgraphs; in these situations, I want to learn about all errors “all at once”, so I can resolve them altogether without multiple subsequent invocations.
--defer in CI/CD Pipelines and Considerations for Blue-Green Architecture
In Part 1, I presented the --defer
CLI flag as an option for slimming down local dbt invocations.
dbt run -s my_model --defer --state .state
This can also potentially be used in a CI/CD invocation context, but there are some additional considerations, particularly for workflows that deploy models to production environments.
My team runs two types of dbt CI/CD pipelines:
- “PR” builds
- These are associated with an individual Pull Request and build resources in an isolated database or schema, named after the Git branch where the changeset is tracked.
- These are like a “lite” build that developers may run multiple times before ultimately merging and promoting changes to production via a separate “Main build”. The “PR” database gets discarded when developers are done testing their changes.
- “Main” builds
- These build resources in a “staging” db (e.g.
DBT_CI_MAIN
) and promote them to production in a blue-green swap style. - These are typically more aggressive with full refreshes and validations in order to ensure high data quality before models make it to production.
- These build resources in a “staging” db (e.g.
In a blue-green production deployment, the newly built objects contained in the invocation db (DBT_CI_MAIN
in the above diagram) must be “promoted” to a production db, after the dbt invocation completes successfully. The most straightforward way to do this is to swap
the entire database or schema. This is assuming the invocation db gets created from a clone
before dbt
is invoked.
Another option is to start from an empty db and defer upstream object references to the production db. However, if the invocation db starts from empty, then a full database or schema swap will result in lots of missing objects in production. This means low level objects (tables, views, etc) must be promoted individually via create or replace ... clone
or swap
statements.
While this is doable, it is also somewhat complex to implement at scale. For this reason, the full database cloning approach is probably most suitable for scheduled builds or main
branch CI/CD deployments. --defer
is most suitable for local invocations, and can be used for some, but not all, CI/CD invocations (namely, “PR” builds, where the resulting objects don’t need to be promoted to production).
Wrap Up
This post reviewed three techniques for slim builds in CI/CD pipelines - the state:modified
selector, the --fail-fast
CLI flag, and the --defer
CLI flag. Each of these techniques can reduce the set of models that must be built as part of dbt CI/CD pipelines.
We also looked at some possible CI/CD architectures, and the pros and cons of using --defer
vs object clones for upstream model references in CI/CD pipelines.
Continue with Part 3 of this series, Best Practices for dbt Workflows, Part 2: Slim Scheduled Builds, where we’ll dive into strategies for slim Scheduled builds.