Snowflake Batch Data Loading | Best Practices in 2024

Date: Monday, January 15, 2024

Tomáš Sobotík
Senior Data Engineer & Snowflake SME at Norlys
Ian Whitestone
Co-founder & CEO of SELECT

This is the second part of our series related to data loading in Snowflake. In the first post, we covered the five different options for data loading. In this post, we are going to do a deep dive into the most common data ingestion technique: batch data loading. We'll cover:

How to partition & size your files prior to loading
How to set up your Stage
How to use the COPY INTO command
Virtual Warehouse size considerations
When to use Serverless instead of self managed warehouses

File Preparation Overview

Whether you are batch loading data or loading it in realtime, you must prepare a set of files to load. First, you must choose how your files will be partitioned (organized) in cloud storage. This plays an important factor in how much time Snowflake has to spend scanning cloud storage to list all the available files. Next, you should think about how you can batch your files into the optimal file sizes to ensure the virtual warehouses you use for loading can be fully utilized. Lastly, you need to choose a file format and configure your Stage object.

File Partitioning

Rather than listing all files in a single directory, you should try and organize them along logical paths or dimensions such as the date/time.

For example, if you are storing files in S3, you could have them organized by date:

s3://my-data-bucket/2023/12/01/dataFile1.csv
s3://my-data-bucket/2023/12/01/dataFile2.csv
s3://my-data-bucket/2023/12/02/dataFile3.csv
s3://my-data-bucket/2023/12/02/dataFile4.csv
s3://my-data-bucket/2023/12/03/dataFile5.csv

You could even go a step further and add additional dimensions to group them by. For example, if we cared about organizing our data by department, we could store the files like:

s3://my-data-bucket/finance/2023/12/01/dataFile1.csv
s3://my-data-bucket/finance/2023/12/01/dataFile2.csv
s3://my-data-bucket/marketing/2023/12/02/dataFile3.csv
s3://my-data-bucket/marketing/2023/12/02/dataFile4.csv

The important takeaway here is that you should organize your files based on how you plan to load them into Snowflake. The goal is to reduce the amount of files that need to be scanned by Snowflake. Even though Snowflake automatically keeps track of what files have already been loaded, having to list all the files inside stage can become very time consuming.

Snowflake file partitioning in cloud storage

When you organize files in these directories, it only needs to scan files inside those directories instead of the entire stage. Once you have this organizational structure in place, you need to use the path filters and combine it with PATTERN keyword.

Consider the following two examples.

This example scans all the files in given stage even though the PATTERN keyword is used. That’s because the PATTERN keyword is applied after all files in stage are scanned.

COPY INTO raw_table
FROM @my_stage
PATTERN='.*[.]csv'

To ensure Snowflake skips the expensive scanning of un-needed files, you must specific a path filter. This second example will only scan files in the /orders/2023_12 directory, and then find all CSVs in that path.

COPY INTO raw_table
FROM @my_stage/orders/2013_12
PATTERN='.*[.]csv'

Optimal File Sizes for Loading

The biggest impact on data loading efficiency will be the number and size of the files you are loading.

Snowflake recommends you aim for file sizes between 100 - 250 MB compressed. If you have many large files that are well above 250MB, you should try to split them before loading. Similarly, if you have too many small files (<10MB), you should try and combine them before loading.

The motivation for this is to achieve optimal utilization of your virtual warehouse by loading files in parallel and saturating all the available cores/CPU processing threads.

Consider the image below. It makes a huge difference if you load a single 200 GB file and utilize only 2% of the XL warehouse (which can process up to 128 files in parallel) or if you split the file into many more smaller files and fully saturate it!

Snowflake file sizes for optimize data loading example 1

Each warehouse size can process a different number of files in parallel. As you increase the warehouse size, you double the number of nodes. Each node has 8 threads and each thread can process 1 file. This means that the smallest warehouse (XS) with a single node and 8 threads can process up to 8 files in parallel!

Warehouse size	# Threads / Files to process
XS	8
S	16
M	32
L	64
XL	128

To further demonstrate these effects, consider the image below. If we have one 1GB file, we will only saturate 1/16 threads on a Small warehouse.

Snowflake file sizes for optimize data loading example 2

If you instead split this file into ten files that are 100 MB each, you will utilize 10 threads out of 16. This level of parallelization is much better as it leads to better utilization of the given compute resources. It's worth noting that an XSMALL would be a better choice here.

Snowflake file sizes for optimize data loading example 3

Setting up a Stage

When it comes to batch data loading, there are three additional Snowflake resources you will have to work with:

The FILE FORMAT object, which defines the format of the files
The STORAGE INTEGRATION object , which encapsulates the access information for cloud storage locations.
The STAGE object, which describes where the files are stored and how Snowflake should access that storage location

Let's dive into each.

File Format

In Snowflake there are multiple ways in which you can define the File Format:

As part of the COPY command
As part of the STAGE definition
As a standalone object

As a best practice, it is recommended to define the File Format as a standalone object, allowing it to be easily reused across your data loading operations. In addition to ease of reuse, maintainability becomes much easier. If you need to modify something with your file format (i.e. change the delimiter, skip header, etc.), you can do it centrally on the object. You do not need to go into each of your pipelines and modify all of them. A final benefit is you can control access to the formats using Snowflake's RBAC model.

Example File Format Creation

Consider a CSV file like this:

The corresponding File Format object would be defined like this:

CREATE OR REPLACE FILE FORMAT my_csv
TYPE = csv
FIELD_DELIMITER = ‘,’
SKIP_HEADER = 1
NULL_IF = ''

You can head to the Snowflake documentation for more details

Snowflake File Format Options

Snowflake offers support for many different file formats:

CSV
JSON
AVRO
ORC
Parquet
XML

Check out their documentation for the options with each file format.

Storage Integration

The Storage Integration object acts as security layer for a Stage. It allows you to create Stages where the credentials are neither shared nor stored in Snowflake. They are also reusable, meaning you can define them just once and then use them with multiple Stages. With a Storage Integration object in place, developers don’t need to know about which roles or credentials are required to access the files, they can just point their Stage to the right Storage Integration object. This is in line with the best practice of segregation of duties.

Here is an example of how to create a Storage Integration object for Amazon S3:

CREATE STORAGE INTEGRATION s3_int
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::00123456789:role/myrole
STORAGE_ALLOWED_LOCATIONS = ('s3://bucket1', 's3://bucket2')
STORAGE_BLOCKED_LOCATIONS = ('s3://bucket3/sensitive_data/')

You'll notice multiple things:

The object requires the AWS ARN of the IAM role. This means you'll likely have to work with your AWS cloud admins in order to create an access role for Snowflake to use, unless you have that access already.
You can define multiple buckets or locations for a single storage integration. This is helpful, as it means you can reuse same Storage Integration for multiple S3 buckets or data pipelines.

To learn more about Storage Integrations, head to the Snowflake docs.

Stage

Now that we've created the File Format and Storage Integration objects, we are ready to create our Stage.

A Stage is file location description. It defines where your data files for ingestion are stored. There are two types of stages:

Internal stages that are part of your Snowflake account
External stages, which are typically pointing to object storage locations (like an AWS S3 bucket) in your cloud provider

In terms of best practice, we recommend to use external stages where possible for two reasons:

Most Snowflake customers already have their data files loaded into Cloud Storage, so it makes sense to define an external Stage that points to those files.
By keeping a copy of your raw data outside of Snowflake, you can easily access or use those files with other systems

If you have followed the best practice of defining separate File Format and Storage Integration objects, creating a Stage becomes quite simple. Here's an example command to create an external Stage pointing to a location in AWS S3:

CREATE STAGE my_s3_stage
STORAGE_INTEGRATION =s3_int
URL = 's3://bucket1/path1/'
FILE_FORMAT= my_csv

As mentioned earlier, you can alternatively skip creating the separate File Format and Storage Integration objects and instead define all the relevant properties directly in the Stage.

How to use the COPY INTO command

Once you've set up the necessary FILE FORMAT, STAGE and STORAGE INTEGRATION objects, you're now ready to load data into Snowflake from your cloud storage provider. To accomplish this, you can leverage the COPY command.

The COPY command is Snowflake feature used for both data loading and unloading, with different parameters for each direction of data movement. Let’s have a look at how you can use COPY command for data imports.

Basic Usage

In its simplest form, the COPY could look something like this:

COPY INTO mytable
FROM @my_s3_stage

This simplicity arises from the fact that we have defined the file format as part of the stage object. As a result, no other parameters or settings need to be specified. You can refer to the COPY command documentation for more details.

Performing Simple Data Transformations

The COPY command also supports simple data transformations using the SELECT statement. Operations like omitting the columns, re-ordering them, or casting to a different data type are all possible. You can also FLATTEN semi-structured data into individual columns or add a CURRENT_TIMESTAMP as an audit column.

Individual columns in the SELECT statement are then referenced through the dollar sign together with the column position: $1, $2, $3, etc.

Here's an example where we select only first, second, sixth and seventh columns from the staged file:

COPY INTO home_sales(city, zip, sale_date, price)
FROM (
    SELECT t.$1, t.$2, t.$6, t.$7 
    FROM @mystage/sales.csv.gz t
)

The following operations are not supported in the COPY statement:

WHERE
ORDER BY
LIMIT
FETCH
TOP
JOIN
GROUP BY

Copy Command Metadata

The COPY command generates metadata which can be retrieved from different table functions/views:

COPY_HISTORY - a table function in the snowflake.information_schema schema
LOAD_HISTORY - a table function in the snowflake.information_schema schema
LOAD_HISTORY - a view in the snowflake.account_usage schema

This metadata contains information about each loaded file, e.g. file name, row counts, error counts, target table, etc. This metadata then prevents you from loading the same file into the Snowflake more than once. The load metadata expires after 64 days. If you need to reload the same file multiple times, there is COPY command option FORCE = TRUE which allows you to reload the same file.

Virtual Warehouse Size Considerations for Data Loading

The COPY command requires an active virtual warehouse to run and ingest the files. With Snowflake virtual warehouses, you pay for each second the warehouse runs, with a minimum billing period of 60 seconds each time the warehouse resumes.

When it comes to choosing the right warehouse size for data loading specifically, there are two things to keep in mind:

As we discussed earlier, you want to batch your files so they are around 100MB - 250MB (compressed). This is important to ensure you have enough files to effectively use all the available processing threads on the warehouse (as opposed to having 1 big file which can only be processed on a single thread). On the other hand, having too many small files can result in unneccessary overhead costs.
Always use the smallest possible warehouse size that will satisfy your SLAs. See the case study below for a real world example of this in practice.

Using a larger warehouse size will often result in unused cores/threads because you will not have enough files. What will also happen is you will process all the files in a few seconds, but you still get billed for the minimum 1 minute biller period. As a result of this, we recommend you start with smallest warehouse size possible, and then only increase the warehouse size if the data isn't being loaded quick enough for your SLAs.

Hypothetical Warehouse Sizing for Data Loading

To iron out these concepts, let's look at a hypothetical example.

Imagine you need to ingest 1 GB of data on daily basis. You already know that loading one big file isn’t effective, so the files are produced in 100 MB chunks (10 files). We'll assume a Snowflake credit price of $3/credit.

Warehouse size	Number of cores	Utilization ratio	Run time	Billed time	Cost/day	Cost/year
XS	32	~30%	10s	60s	$0.2	$73
M	8	~100%	50s	60s	$0.05	$18.25

With the Medium warehouse, we can only achieve ~30% utilization at best, because there are only 10 files to process and 32 available cores for processing. Compare that with the X-Small warehouse, its 8 cores can be fully saturated. Even though the Medium warehouse processes the files 5 times faster than the X-Small warehouse, it ends up costing 4 times as much due to the 50s of idle time that you must also pay for.

Real World Example of Warehouse Sizing for Data Loading

To see an example of how this played out in the real world, check out the results from an actual Snowflake account.

Originally, this data loading pipeline used a Large warehouse. It was loading all the files in about 5 seconds and then sitting idle for another 55 seconds. After the warehouse was dropped to an X-Small size, the loading time increased to ~20s on average.

Snowflake warehouse data loading cost optimization part 1

More importantly, the costs dropped off the cliff, from ~$130/day to less than $5/day, representing significant cost savings of ~$45,000 / year. For this customer, having the data load in ~20s instead of ~5s was acceptable for their SLAs, so the tradeoff was well worth it.

Snowflake warehouse data loading cost optimization part 2

When to use serverless compute for data loading?

In our previous post on the 5 different data loading options in Snowflake, we shared some tips on when to consider using serverless tasks versus a self-managed warehouse.

We encourage you to read that post for more details, but if you want a quick summary, you should consider using Serverless Tasks if your data loading jobs take less than 40 seconds for the given warehouse size. Why 40s? Because Serverless Tasks compute is charged at 1.5X the rate of the compute for self managed warehouses. Here's an example table to break it down (assuming $3/credit):

Warehouse size	Compute Type	Run time	Cost
XS	Serverless	10s	$0.0125
XS	Self Managed Warehouse	10s	$0.05
XS	Serverless	35s	$0.0375
XS	Self Managed Warehouse	35s	$0.05
XS	Serverless	40s	$0.05
XS	Self Managed Warehouse	40s	$0.05
XS	Serverless	50s	$0.0625
XS	Self Managed Warehouse	50s	$0.05

If you're curious about the cost calculations, here's how it is done for the first row:

An X-Small warehouse using 1 credit per hour
For serverless tasks, we need to multiply that by 1.5
Serverless tasks are billed per second, so the cost becomes $0.00125/second (1 credit/hour * 1.5 * $3/credit / 3600 seconds / hour)

And for the second row:

An X-Small warehouse is billed for a minimum of 60 seconds
The cost for anything less than 60 seconds is: $0.05 (1 credit / hour * $3/credit / 60 minutes / hour)