Terraform for Streamlined Snowflake Management (2024)
- Date
- Gary JamesSenior Analytics Engineer at Beauty Pie
What is Terraform?
Terraform is an Infrastructure as Code (IaC - no GUI fiddling!) tool for managing your configuration of services, via the use of providers (usually service specific - Snowflake, AWS, dbt Cloud, etc).
Providers are to Terraform what packages are to Python, pre-bundled bits of reusable interfaces built for a specific purposes, and continuing this analogy, the Terraform registry would be a bit like Pypi.
Why should you use Terraform for managing Snowflake Infrastructure?
There’s 3 compelling reasons you should consider using Terraform to manage your Snowflake infrastructure.
1. To version control your configuration
With Terraform you infrastructure configuration is stored in a version controlled repository like Github. This means you get the usual benefits of storing code in Github: an easy way to rollback, preview changes, long term code history, and those sweet, sweet reviews from your team members before you release any changes.
Github aside, Terraform itself maintains its own state of your infrastructure’s resources, helping to safeguard against unexpected changes by providing a diff during each deployment.
2. Scale
Terraform enables configuration reusability, allowing you to easily scale your Snowflake deployments. At the highest level, this means being able to easily manage multiple Snowflake accounts from one location is a big plus of Terraform. Also, the functionality to scalably create/update/remove resources with common templated configurations via loops or modules can really cut down management time for those broad tasks.
It also helps you scale internal collaboration and best practices. It provides a common language across repositories, teams, and even disciplines, which is super helpful for any data team these days!! Common tooling across an organisation can also cut down on time spent maintaining bespoke deployment pipelines, if you’re all running by the same playbook.
3. An easier way to manage dependencies
Terraform builds a graph of dependencies between your resources using the objects you define, which means it can call you out for trying to remove resources that are needed downstream, like dropping a table used in a stream.
Is Terraform (or any IaC) right for you?
Even in the sketchiest of implementations, IaC tools will always do one thing, and that’s get you to write your setup down! Who doesn’t love a good bit of documentation, eh? And if that just so happens to be in version control then it’s likely there’s at least some mention of why that config was added or when it was added, all useful context.
The other side of the coin is that IaC is by it’s very nature a step removed from the platform you wish to interact with, and the more you lean into techniques like modules, the further removed your configuration can become, which may just add obscurity for others trying to use it, or worse, introduce brittleness to your stack, so abstractions should be done thoughtfully and strategically.
Is the configuration of this service really important? Is the juice worth the squeeze? 🍊
Every aspect of your infrastructure has associated risks, such as misconfiguration, and downtime. Moving your configurations (existing or new) to IaC is not without investment, so consider how vital this component is to your business before diving in.
Can you invest the time to set it up and maintain it? Is Snowflake core enough to the business to warrant it? Is recovery time of Snowflake a factor? Do you need to factor in moving existing Snowflake resources to Terraform?
Do those administrating this service prefer it in code?
Sometimes your admins aren’t going to want to (or be able to!) go through code/version control each time a change is needed, and the consequences of messing up are tolerable, don’t hinder flexibility!
That said, Terraform (or other IaCs) can also provide a common language across development teams, so can also add a level of consistency and reusability (see Terraform Modules) to your org, facilitating a democratised approach to administration, sharing what can be a burden on senior employees.
Does the provider cover your needs? Is there even a provider?
Not everything has an available provider in Terraform, and even if it does, sometimes there can be a bit of a delay between new functionality release and reflection of that in the provider, so it’s worth verifying what’s available before starting with a provider, or going down the path of building your own.
How to use Terraform with Snowflake
Whether you’re developing locally, in Codespaces, or starting a production setup, the initial steps are fairly similar, so let’s introduce some of the base concepts for using any Terraform provider.
The Snowflake Terraform Provider
Checking the Terraform Registry, you may find various Snowflake related providers and modules available, but the provider we’ll focus on is the one now under the maintenance of Snowflake themselves: the official Snowflake Terraform Registry.
Getting Set Up
To begin a Snowflake Terraform project:
- Install Terraform - see docs, i.e. via Homebrew for Mac OS X:
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
terraform -help
- Create a new directory/ repository
mkdir snowflake-terraform-example
cd snowflake-terraform-example
- Create a file,
[versions.tf](http://versions.tf)
, this will contain the versions for the Terraform Providers you will use, for this example, it should look something like:
terraform {
required_version = ">= 1.0.0"
required_providers {
snowflake = {
source = "snowflake-labs/snowflake"
version = "0.84.1"
}
}
}
- To configure your provider, create a file,
[providers.tf](http://providers.tf)
, this is where you’ll configure the providers you use, you can pass in credentials here when testing things out, but it’s best to find an environment variable solution when storing in a repository (so you don’t expose your secrets anywhere!). An example might look like:
provider "snowflake" {
role = "SYSADMIN"
}
How to deploy Terraform for Snowflake with Github Actions
Setting up your credentials
In order to interact with Snowflake via Terraform in Github Actions, it’ll need some credentials to log in to Snowflake, and there’s a few options here: Username/ Password, Private Keys, OAuth, etc, but for this example we’ll stick with Username/ Password, for other authentication options, please see the provider documentation.
To store your credentials at repository level in Github Secrets: Your Repository > Settings > Secrets and variables > Actions > New repository secret
With the following values:
SNOWFLAKE_ACCOUNT
- Your account identifier, with region/ provider details, i.e.select_dev-production
orselect123.eu-west-1.aws
for legacy accounts.SNOWFLAKE_USER
- Login usernameSNOWFLAKE_PASSWORD
- Login password
Checks to run in CI
When making infrastructure changes in a repository containing Terraform, there are some useful steps to consider adding to your CI to help review any proposed changes:
terraform fmt -check
- Checks the formatting of your Terraform configuration, part of the beauty of Terraform is the readability, so a key step, the-check
is added to fail on incorrect formatting, without it will assist in fixing the formatting (better in development than in CI)terraform validate
- Runs basic validation, doesn’t call to remote services, but will check things like attribute names, etc.terraform plan
- Also useful in development, but this compare your latest Terraform configuration to that of the remote state of the service, i.e. figures out what changes are about to be applied (but doesn’t apply them). This step is super useful in CI, as it shows exactly what is about to change, and should be reviewed before acceptance.
Putting this all together, an example CI pipeline in Github Actions could be something like:
name: Terraform Snowflake CI
on:
pull_request:
branches:
- main
env:
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
jobs:
RunChecks:
name: Run Checks
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Get Terraform version
run: echo "TERRAFORM_VERSION=$(cat .terraform-version)" >> "$GITHUB_ENV"
id: set-terraform-version
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TERRAFORM_VERSION }}
- name: Terraform init
id: terraform-init
run: terraform init
- name: Terraform format
id: terraform-fmt
run: terraform fmt -check
- name: Terraform validate
id: terraform-validate
run: terraform validate
- name: Terraform plan
id: terraform-plan
run: terraform plan
Copy and paste the above into a file such as .github/workflows/ci.yml
, the next time you make a Pull Request, it’ll run checks against your branch, which will look as such in Github Actions:
Example CI pipeline on a Pull Request in Github Actions
(Optional Extra) - If you are able to facilitate having many complete replicas of your environment, you could also add the terraform apply
step to your CI. The benefits of this would be catching any errors that are only raised by Snowflake rather than the Terraform Provider when the changes are made, as well as allowing for manual validation in Snowflake, and also bringing the development flow closer to that of some common engineering ones.
The downside to this is that it would likely require a separate Snowflake account, or at the very least a separate set of credentials to split your CI environment from your Deployment environment, which is beyond the scope of this article.
But, for completeness, adding this to ci.yml
would simply require an extra, final block as such:
- name: Terraform apply
id: terraform-apply
run: terraform apply -auto-approve
Deploying your changes
Assuming all is well from the above, and everyone is happy with the changes, we can deploy these via a deployment pipeline
terraform apply -auto-approve
- Applies the changes to your infrastructure, when running via actions-auto-approve
will allow Terraform to continue without user intervention, as it would expect a prompt when run on the command line.
So a very simple deployment step! Which would look as such in Github Actions, somewhere like .github/workflows/deployment.yml
:
name: Terraform Snowflake Deployment
on:
push:
branches:
- main
env:
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
jobs:
RunChecks:
name: Run Checks
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Get Terraform version
run: echo "TERRAFORM_VERSION=$(cat .terraform-version)" >> "$GITHUB_ENV"
id: set-terraform-version
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TERRAFORM_VERSION }}
- name: Terraform init
id: terraform-init
run: terraform init
- name: Terraform apply
id: terraform-apply
run: terraform apply -auto-approve
Storing the Terraform State for Github Actions
In order to retain a memory of how your infrastructure looks, Terraform creates a state file containing a list of all your resources and their configuration as they currently are (or should be!).
Now, when making changes via Pull Requests and Github, the Github Action will then need to access the current state of resources, which will be the resulting infrastructure from the previous terraform apply
step. So, how do we ensure it has access to that from a previous action?
Enter, Terraform Backends (for those not using Terraform Cloud), this will allow you to configure your Terraform setup to store it’s state file somewhere more permanent, rather than in your local storage.
To begin this setup, you’ll need to add a backend block to your Terraform project, i.e. a [backend.tf](http://backend.tf)
file, below is an example of this block for an Amazon S3 backend (without DynamoDB setup for now).
terraform {
backend "s3" {
bucket = "terraform"
key = "state/snowflake.tfstate"
region = "eu-west-1"
}
}
Simple enough, but then comes the slightly tricker bit, incorporating this into your deployment pipeline, well thankfully, once authenticated to AWS, Terraform will take care of updating the remote state for us!
In order to do this, add 3 more Github Secrets for your AWS account:
AWS_ACCESS_KEY_ID
- Access key identifier for your AWS accountAWS_SECRET_ACCESS_KEY
- Access key secretAWS_REGION
- AWS region, i.e.eu-west-1
And update your ci.yml
/deployment.yml
to contain the following step (or similar authentication) before running any Terraform commands.
- name: Authenticate against production account
uses: "aws-actions/configure-aws-credentials@v2"
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
All in all, a rough pipeline for the Terraform Snowflake Github Actions workflow would look something like this:
Snowflake Terraform Examples
If you’re looking for places to start, some basic examples of the core Snowflake concepts will look a little something like so, first up, let’s create a warehouse (X-small, obviously 😉).
resource "snowflake_warehouse" "reporting_warehouse" {
name = "reporting_wh"
comment = "Warehouse for reporting and BI tools"
warehouse_size = "x-small"
}
Mega, we have a way to query our Snowflake data, but so far, nowhere to store it, so let’s add a good old fashioned database.
resource "snowflake_database" "reporting" {
comment = "Reporting database"
data_retention_time_in_days = 30
name = "REPORTING"
}
Beautiful stuff, and finally let’s set ourselves up with a Snowflake role, and grant it to our user for reporting.
resource "snowflake_role" "reporter" {
name = "REPORTER"
comment = "This role is limited to read only querying on the reporting database"
}
resource "snowflake_role_grants" "reporter" {
role_name = snowflake_role.reporter.name
roles = [
]
users = [
snowflake_user.reporting_user.name
]
}
The final example demonstrates how Terraform creates references between objects, with the snowflake_role_grants
depending on snowflake_role
and snowflake_user
resources, this is how Terraform will determine the order in which to execute the SQL.
Further Examples
For a working example of this article, check out the Github repository below, forking it and creating a Codespace is a quick way to get started!
GtheSheep/terraform-snowflake-example
For a hands-on setup of Terraform for Snowflake, check out Snowflake’s own guide on getting started: Terraforming Snowflake.
Challenges with Snowflake and Terraform
As with any tooling, Terraform is not without it’s pitfalls, and sometimes the behaviours or use cases of Snowflake can even exacerbate that, so let’s check a few off:
Full Resource Coverage
Although there’s a super bunch maintaining the Snowflake Terraform provider, sometimes it can take a little while to bring it up to speed with the latest and greatest Snowflake functionality, or sometimes it’s a design choice to explicitly not allow something to be done via Terraform (a rarer situation), this can mean there’s still a requirement to do some management in Snowflake itself.
On top of this, some resources aren’t completely fleshed out with all the bells and whistles that Snowflake make available, which can result in a need to run some “bonus” SQL commands in Snowflake after the resource has been deployed.
As an example, the equivalent of GRANT SELECT ON ALL TABLES IN SCHEMA <schema> TO ROLE <role>
was not directly available in the provider, for years it required either finding a way to iterate over the objects within the schema, or running extra SQL, thankfully now the on_all
parameter exists!
Ownership and Lifespan
When deciding which Snowflake resources to manage via Terraform, consider their lifespan and ownership.
Lifespan could refer to just whipping things up and down for testing, POCs, etc, and whether that’s worth the effort, but also goes further, think about transient tables, or ones that are frequently dropped and recreated independently of when Terraform might deploy things (dbt, I’m looking at you 👀). This kind of activity can cause Snowflake to end up very out of sync with the Terraform state, and you’ll end up with your deployments in a pickle 🥒.
As for ownership, decide which process truly owns these objects, could Terraform be used to ring-fence the environment for these processes to operate in? (Terraform a database in which dbt can manage the schema 🤷♂️).
As an example, let’s assume we have a raw data loading tool, Airbyte, and a transformation tool, dbt:
- Airbyte would generally take a target schema to load data into, after that, we allow it to own and manage the tables within that Snowflake schema, adding columns and tables as it needs (assume we default to selecting all resources available from the source). In this case, we wouldn’t want to manage the tables with Terraform, as it’ll quickly be outdated by Airbyte’s activity, but we could manage the schema with Terraform, both the creation, and the access control, as that is immutable from Airbyte’s point of view.
- dbt could then transform the data to multiple Snowflake schemas, creating them, and building tables in each. Here, it wouldn’t make sense to even manage the schemas with Terraform, as dbt has full ownership of them, and could even handle the access control for you, but the database is the area in which dbt is allowed to operate, so we could provision that via Terraform.
Both of these processes run entirely independently of your Terraform pipeline, so the state would either not know of the changes made by these other process and create a lot of noise in your terraform plan
, or worse, overwrite the changes made by Airbyte/ dbt!
Managing multiple repositories
When resources are required and/ or interacted with from multiple locations, it can get a little tricky.
As mentioned above, it’s key to nail down where the ownership of Snowflake resources is, as conflicting definitions can cause different deployments to overwrite the configurations of others, i.e. if 2 repos give USAGE
on a database to different roles (somewhat mitigated via the enable_multiple_grants
parameter these days).
Dependencies can be managed via Terraform modules, reading things from data sources, and so on, but to help stop this flexibility sprawling out of control, develop a solid plan of how you’re going to structure the resource management, i.e. Central repo managing Snowflake roles/ users/ grants/ warehouses/ etc, all the really core elements of an account, then allow others to provision their own databases/ tasks/ procedures as required.