Best Practices for dbt Workflows, Part 2: Slim CI/CD Builds

Date
  • Alex Caruso
    Lead Data Platform Engineer at Entera

In Best Practices for dbt Workflows, Part 1: Concepts & Slim Local Builds, I introduced the concept of a “slim” dbt build, and provided some examples for local dbt development. I also described various “dbt invocation contexts” (Local, CI/CD, and Scheduled).

SELECT best practices for dbt workflows

This post is a followup to Part 1, where we’ll look specifically at strategies for slim dbt builds in a CI/CD invocation context (the middle section in the image above).

Part 1 Recap: Local dbt Slim Builds

“Slim” dbt builds minimize redundant, unnecessary, or erroneous model invocations. When we invoke dbt, we’re usually trying to achieve one of two things:

  1. Building, testing, and validating code / logic changes to dbt resources (models, tests, etc)
    1. Local or CI/CD invocation context
  2. Refreshing models as new source data comes in
    1. Scheduled invocation context

In both cases, we usually only need to build a smaller subset of the DAG. For any given dbt invocation, there is some minimum set of resources that need to be built to achieve the goal. The goal of slim builds is to get as close as possible to that minimum set, so we don’t waste compute time and money on other resources.

Local dbt builds can be “slimmed down” to target only relevant resources using the --defer CLI flag, the --empty CLI flag, or row sampling techniques.

Prerequisites

  • You’ve read up until Slim Local Builds in Best Practices for dbt Workflows, Part 1: Concepts & Slim Local Builds
  • You’ve setup dbt artifact persistence, as described in Part 1
  • You have a functioning CI/CD pipeline to invoke dbt against an isolated database or schema in your destination environment

Slim CI/CD Builds

state:modified Selector

The most powerful tool for slim CI/CD builds is the state:modified node selector. This selector allows you to run only the models which have been modified relative to some previous dbt DAG state, as described by a manifest artifact. Typically, this manifest will be the latest production manifest.

Suppose your project has 1000 models, and you open a Pull Request which modifies just 2 of those models. In order to test and validate your changes, you might want to build just those 2 models and their downstream dependencies, instead of the whole DAG. This can be done with the following command

dbt run --select state:modified+ --state .state

dbt will be able to dynamically determine which resources have been modified in the current changeset, relative to the latest production manifest.

dbt now supports some more specific state:modified selectors, such as state:modified.body, state:modified.config, etc. Check out the docs for detailed info on these sub-selectors.

--fail-fast CLI Flag For Early Exits & Fast Feedback

The --fail-fast CLI flag forces dbt invocations to exit immediately with a 1 exit code as soon as any error is encountered. By default, dbt won’t do this. Instead, it will continue running all other resources which are unaffected by the error, for example unconnected subgraphs, until completion of the job.

This is not always desirable in a CI/CD context, because those extra resource invocations can be costly, plus it will take longer for you to receive feedback about the ultimate status of the build.

SELECT best practices for dbt workflows

The only downside of this approach is that if your project has other errors downstream from the failed resource, you will not find out about them until you resolve the first error which triggered the fail fast, and run another dbt invocation. This means you’ll need repeated dbt invocations to resolve multiple errors. While this is tedious, it can still be a lot more cost efficient than allowing all resources to build every time. Typically I disable the --fail-fast flag if I have a large changeset which modifies multiple unrelated subgraphs; in these situations, I want to learn about all errors “all at once”, so I can resolve them altogether without multiple subsequent invocations.

--defer in CI/CD Pipelines and Considerations for Blue-Green Architecture

In Part 1, I presented the --defer CLI flag as an option for slimming down local dbt invocations.

dbt run -s my_model --defer --state .state

This can also potentially be used in a CI/CD invocation context, but there are some additional considerations, particularly for workflows that deploy models to production environments.

My team runs two types of dbt CI/CD pipelines:

  1. “PR” builds
    1. These are associated with an individual Pull Request and build resources in an isolated database or schema, named after the Git branch where the changeset is tracked.
    2. These are like a “lite” build that developers may run multiple times before ultimately merging and promoting changes to production via a separate “Main build”. The “PR” database gets discarded when developers are done testing their changes.
  2. “Main” builds
    1. These build resources in a “staging” db (e.g. DBT_CI_MAIN) and promote them to production in a blue-green swap style.
    2. These are typically more aggressive with full refreshes and validations in order to ensure high data quality before models make it to production.
SELECT best practices for dbt workflows

In a blue-green production deployment, the newly built objects contained in the invocation db (DBT_CI_MAIN in the above diagram) must be “promoted” to a production db, after the dbt invocation completes successfully. The most straightforward way to do this is to swap the entire database or schema. This is assuming the invocation db gets created from a clone before dbt is invoked.

Another option is to start from an empty db and defer upstream object references to the production db. However, if the invocation db starts from empty, then a full database or schema swap will result in lots of missing objects in production. This means low level objects (tables, views, etc) must be promoted individually via create or replace ... clone or swap statements.

While this is doable, it is also somewhat complex to implement at scale. For this reason, the full database cloning approach is probably most suitable for scheduled builds or main branch CI/CD deployments. --defer is most suitable for local invocations, and can be used for some, but not all, CI/CD invocations (namely, “PR” builds, where the resulting objects don’t need to be promoted to production).

Wrap Up

This post reviewed three techniques for slim builds in CI/CD pipelines - the state:modified selector, the --fail-fast CLI flag, and the --defer CLI flag. Each of these techniques can reduce the set of models that must be built as part of dbt CI/CD pipelines.

We also looked at some possible CI/CD architectures, and the pros and cons of using --defer vs object clones for upstream model references in CI/CD pipelines.

Continue with Part 3 of this series, Best Practices for dbt Workflows, Part 2: Slim Scheduled Builds, where we’ll dive into strategies for slim Scheduled builds.

Alex Caruso
Lead Data Platform Engineer at Entera
Alex is a Lead Data Platform Engineer at Entera, based out of New York, United States.

Get up and running with SELECT in 15 minutes.

Snowflake optimization & cost management platform

Gain visibility into Snowflake usage, optimize performance and automate savings with the click of a button.

SELECT web application screenshot

Want to hear about our latest Snowflake learnings? 
Subscribe to get notified.