Data Loading in Snowflake: A Comparison of 5 Different Options
- Date
- Tomáš SobotíkLead Data Engineer at Tietoevry
Knowing how to efficiently load data into your Snowflake account is crucial. Snowflake provides a variety of different data ingestion options, each with their own pros, cons and important considerations.
Customers have several different options for both batch and continuous/real time loading, all of which scale to support loading large datasets. Selecting the right option can have an impact on cost, performance, reliability and ease of maintenance. Each of these factors needs to be taken into consideration.
In this blog post, I'll do a deep dive into the five different data loading options and highlight the different tradeoffs like latency, cost, maintenance and required technical expertise. You'll also learn some best practices for each method. At the end, I'll summarize the different options and provide some general guidelines to help you choose between them.
Batch loading
Batch loading is the most common data ingestion approach used in all data warehouses. With batch loading, data is bulk loaded into the data warehouse on a fix schedule (i.e. once per day, once per hour, every 4 hours, etc.). Batch loading is usually the simplest option, but comes with a big tradeoff: latency. If your data loads once per day, then it can be up over 24 hours stale before the next load happens.
The batch ingestion typically works by taking files from cloud storage (an Amazon S3 bucket, Microsoft Azure Blob storage or Google Cloud Storage), and having some process or orchestration tool automatically load that data into Snowflake on a schedule.
In Snowflake, there are two main approaches to batch data ingestion.
Batch loading with your own virtual warehouse
In my experience, the most common way to load data into Snowflake is by running the COPY INTO
command on one of your own managed virtual warehouses.
This command takes files previously uploaded into a Snowflake internal stage or external stage, and ingests them into the target table. The files could be CSV files, JSON data, or any other supported file format options.
The COPY
command runs on one of your own managed virtual warehouse, meaning you are responsible for creating and properly sizing the virtual warehouse (hint: always start with an X-SMALL warehouse and size up if required to meet your SLA).
Cost & efficiency considerations
When loading data using the COPY INTO
command in Snowflake, you pay for each second the virtual warehouse is active. Snowflake virtual warehouses are billed for a minimum of 60 seconds each time they resume, so you'll want to make sure your data loading process takes close to a minimum of 60 seconds each time it runs.
The smallest virtual warehouse is able to ingest 8 files in parallel. This number is doubled with every warehouse size. For example, to fully utilize a medium sized warehouse you would need to have 32 files. If one of the files is significantly larger than the others, it's possible for the other threads to sit unutilized while only one thread is doing work.
Batch loading data on your own virtual warehouse can be one of the most cost effective ways to load data into Snowflake, provided that you have enough files & data to load to fully saturate the warehouse, and don't end up paying for idle compute costs. Be sure to check out our other blog post on batch data loading best practices.
Required skills
Using the COPY
command does not require any additional skills on top of basic SQL knowledge.
With this option, users will need to have some scheduling/orchestration tool that runs the COPY
command on the desired schedule.
Batch loading with serverless tasks
Serverless tasks allow Snowflake users to run SQL commands on a defined schedule using Snowflake compute resources instead of their own managed virtual warehouse. You can set up a serverless task to run the COPY INTO
command at your desired frequency.
Similar to running the COPY INTO
command on your own managed warehouse, the latency or staleness of your data will be tied to how often you run the COPY
command to load new data.
Cost & efficiency considerations
Because you only pay per second of compute used, serverless tasks can help with the under utilized compute problem that comes with batch loading on your own virtual warehouse. If your data loading process only takes 10 seconds each time it runs, then you'll end up paying for 50 additional seconds of compute costs every time (when running on your own managed warehouse).
The compute costs for serverless tasks are charged at 1.5X the rate of an equivalent sized virtual warehouse managed by you. As a result of this, from a cost efficiency standpoint they should only be used if you cannot fully saturate a virtual warehouse for at least 40s of the minimum 60 second billing period.
Required skills
Loading data into Snowflake with a serverless task is arguably the easiest data ingestion option out there. It only requires a working knowledge of SQL, and does not require any third party orchestration tools.
Continuous loading with Snowpipe
Unlike batch processing which runs on a fixed schedule, continuous loading processes data based on some event, usually a new file landing in cloud storage. The source of these files could be anything from change data capture service which logs records from a relational database to web application event data. Each new file triggers an event notification, which automatically kicks off the loading process. Cloud based messaging services like AWS SNS or SQS are often used to trigger these notifications. They can be directly integrated with Snowflake and inform the platform about the newly landed files.
With continuous loading, data latency will be much lower as new data is constantly being loaded as soon as it becomes available.
To help streamline data loading and enable continuous processing, Snowflake offers a powerful feature called Snowpipe. Snowpipe is integrated with an event notification service on the cloud provider side (i.e. AWS SNS/SQS). Each Snowpipe object also has an associated COPY
command. The event notifications inform a Snowpipe object about newly loaded data files in an external stage, which triggers the specific COPY
command to run.
Most customers use the "auto-ingest" Snowpipe feature, meaning files get automatically loaded as soon as they arrive. But, it's worth noting there is also a REST API for Snowpipe which allows you to choose when a Snowpipe object is triggered.
Snowpipe is serverless feature, meaning you do not have to worry about which virtual warehouse to use or whether it is sized correctly. Instead, the compute is entirely maintained and managed by Snowflake. In terms of latency, Snowpipe will typically ingest the files within minutes of when they land in cloud storage.
Cost & efficiency considerations
The biggest impact on Snowpipe cost efficiency is the file size. Snowpipe charges an overhead fee of 0.06 credits / 1000 files processed. As a result, the cost of loading 100GB can vary significantly depending on the file size, since it will impact the number of files. Snowpipe compute resources are charged at 1.25 credits per compute-hour.
Snowflake recommends aiming for your files to be 100-250MB compressed. If your upstream applications frequently send data with small file sizes, you will want to consider implementing a process that will aggregate the files into larger sized batches. A common service used for this purpose is Amazon's Kinesis Firehose. Alternatively, users can consider other real time data processing options like Snowpipe streaming (discussed below).
Required skills
To set up a Snowpipe loading job, you'll need to be able to access your cloud provider in order to create the required event notification service.
Once the upstream cloud storage and event notification services are configured, setting up the Snowpipe object can be done entirely in SQL.
Real time loading with Kafka
There are many use cases across various industries which require real time data: credit scoring, fraud analysis, or even user facing analytics. Delivering data with low latency is typically accomplished by using a message broker like Apache Kafka. Instead of files, Kafka receives messages from a variety of different "data producers", and then sends those messages out to different "data consumers". In the context of data loading, Snowflake becomes a data consumer.
Using Kafka with Snowflake to achieve low latency data loading can be done in two different ways:
- "Snowpipe mode", which combines Kafka with the traditional Snowpipe methods we discussed above.
- "Snowpipe Streaming" mode, a new offering released by Snowflake in 2023.
Kafka Connector - Snowpipe mode
The Snowpipe mode for Kafka connector uses a combination of micro-batching files and Snowpipes. Kafka messages are flushed into temporary files and ingested via Snowpipe.
With the Kafka connector, you can configure how frequently the new files will be created. As discussed earlier, when using regular Snowpipes you want to aim to have files that are 100-250MB. Maintaining this optimal file sizing can become challenging considering the other factors that come into play:
- how frequently your source generates data and how fast you need to have its data loaded into Snowflake
- the flush rate: how frequently you flush the data into the files, configurable in the Kafka connector options
- the partition count in your Kafka cluster
Cost considerations
With this option, you are only charged for the Snowpipe ingestion, with the same charges and nuances as discussed above.
Required Skills
Using this data loading option requires a working Kafka environment and the corresponding knowledge required to create and manage that infrastructure, which can be a significant barrier for many teams.
Snowpipe Streaming
Snowpipe Streaming is the latest addition of Snowflake features related to data streaming. This option offers even better latency (in seconds) than the Snowpipe mode (discussed above).
With Snowpipe Streaming, there is no stage, no files and no snowpipe objects (which I recognize is confusing, since the name includes "Snowpipe"). With Snowpipe Streaming, data is loaded row by row instead of using files.
Cost considerations
Just like the regular Snowpipe, Snowpipe Streaming uses a serverless compute model. Snowflake manages compute resources automatically based on the streaming load. To estimate the charges, it's best to experiment with a typical streaming workload.
Snowpipe Streaming compute is priced lower than regular Snowpipe at 1 credit per compute-hour (Snowpipe is 1.25). There are no charges for cloud services, but Snowflake does charges an hourly rate 0.01 credits per streaming client. As a result, it's important to be conscious of the number of clients you create. If a user creates 100 clients, then this overhead management fee would cost $22,000 / year assuming an credit price of $2.5/credit (100*0.01*24*365*2.5
).
According to Snowflake and other sources, this method is the most cost effective way of loading data into Snowflake.
Required skills
Snowpipe Streaming is the most attractive data loading option in terms of offering the best cost efficiency and lowest data latency. But it also comes with the highest skill barrier. The Snowpipe Streaming API is part of the Java SDK, which means you must be familiar with Java in order to use the feature.
How to choose a data loading option
Now that we've covered all the data loading options, you are likely wondering how to choose the right one. Ultimately, you will need to take into consideration all the different aspects we discussed:
- What latency does your use case require?
- Are you looking for the most cost efficient option?
- What technical expertise does your team have and how will the loading option fit into your existing stack?
- What system is producing the data and files you need to load, and do you have control over that?
Let's summarize everything in a table to highlight the different considerations with each option:
Method | Option | Lowest Possible Latency | Required skills |
---|---|---|---|
BATCH | COPY command | minutes | SQL |
BATCH | Serverless task | minutes | SQL |
CONTINOUS | Snowpipe | minutes | SQL + cloud |
REAL TIME | Kafka Snowpipe mode | minutes | SQL + cloud + Kafka |
REAL TIME | Kafka Snowpipe streaming | seconds | SQL + cloud + Java |
If you'd prefer a visual decision tree, you can consider the diagram below. Note, this doesn't include every possible edge case or consideration you should make. For example, some users can still use Snowpipe in a very cost effective way even with files that are less than 100MB, provided there aren't too many of them.
This flow chart mainly considers latency & file size as the two key considerations. Looking at the left part of the tree, if you are okay with latency in minutes, but your team is comfortable using the Java SDK, then using Snowpipe Streaming will be the better option as it will deliver data quicker and at a lower cost.
Additional resources
If you are looking to learn more, Snowflake has very comprehensive documentation on Snowpipe Streaming and the other data loading methods. I also highly recommend watching this Snowflake presentation which does a detailed benchmarking of the different data loading options using real data.