Databricks vs. Snowflake: An Honest Comparison in 2024
- Date
- Ian WhitestoneCo-founder & CEO of SELECT
Databricks & Snowflake are two of the most popular data cloud platforms in the market right now.
They started out solving very different use cases: Snowflake as the SQL data warehouse and Databricks as a managed Apache Spark service. They were even partners in the early days!
Today, they are both multi-faceted data cloud platforms solving a variety of different use cases. And, as a result, direct competitors.
Back on February 28, 2024, I had a conversation with Jeff Chou from Sync Computing. Jeff’s company works solely with Databricks customers, while we at SELECT work solely with Snowflake customers.
Due to this nature, we thought it’d be fun to get together and have a genuine conversation about each platform. Neither of us was super familiar with the other platform, but were keen to learn more from each other.
This was an honest and unscripted conversation between two practitioners. No bullshit benchmarks. No marketing fluff.
We talked about their origin stories, the most common use cases we see from actual customers, their strengths and weaknesses, and where they’re both headed.
Below, I've put together a summary of the key things we discussed.
Origin Stories
Initially, DataBricks and Snowflake started as partners, each focusing on different aspects of data management. While Snowflake specialized in data warehousing, DataBricks carved its niche in managed Spark, and then quickly expanded to machine learning (ML) workloads. Interestingly, they used to refer customers to each other.
Fast forward to the present, and both platforms have undergone remarkable transformations. If you look at their websites (snapshotted as of February 27, 2024), Snowflake is now calling itself the "data cloud", while DataBricks brands itself as the "data intelligence platform":
At the end of the day, they are both comprehensive, all-in-one data cloud platforms that serve a variety of different data use cases.
With that said it’s still quite interesting to explore their origin stories as it helps explain the relative strengths and weaknesses of each platform today.
Snowflake was founded in 2012 by data warehousing experts from Oracle, and another data warehousing company called VectorWise. Snowflake came to market 10 years ago in 2014 with their main data warehousing product, which they often referred to as the “elastic data warehouse” due to its unique architecture that allowed it to independently scale compute and storage.
Shortly after Snowflake launched, Databricks launched their first product, in a completely different space. Databricks was founded by the creators of Apache Spark, who were all academics in the high-performance computing research area at Berkeley. Their first product was a managed offering of Apache Spark, along with a notebook interface to interactively run jobs in these computing clusters.
Snowflake started expanding by introducing data-sharing functionality in 2017, followed by a marketplace where customers could buy datasets from each other in 2019.
During a similar time frame, Databricks started going deeper into the ML space by launching their managed MLFlow offering in 2019, followed by MLFlow model serving in 2020.
Company Evolution
An interesting thing to observe is how each company has responded to market demands and introduced competing sets of functionality.
Snowflake developed Snowpark, initially aimed at migrating Spark workloads but evolving into a platform for Python-based ML workloads. They have also invested heavily in adding support for Apache Iceberg so that their customers can manage & leverage their data lakes directly from Snowflake.
Meanwhile, DataBricks launched features like Photon and DataBricks SQL, expanding its footprint in the data warehousing arena.
This is especially evident today when you look at the interfaces for creating a virtual warehouse in Snowflake, and “SQL warehouse” in Databricks. You can see that Databricks has pretty much just copied the design and settings of Snowflake’s virtual warehouses:
While not pictured on the timeline slide above, both companies have come out with a bunch of AI/LLM-related feature announcements in late 2023 and early 2024. It’s still early days in this space, but both companies made large acquisitions in this space and are investing very heavily.
Advantages of Each Platform & Key Differentiators
It’s important to understand how each company started out, as it helps explain the relative strengths and weaknesses of each platform.
Due to its start in data warehousing, Snowflake has a much stronger and more fully featured SQL data warehousing product. For most companies, this will be the most important and most used feature, as most value generated from data strategies will come from a well-managed data warehouse that can serve core business intelligence use cases.
Very few companies I talk to are using Databricks as their “data warehouse”. Instead, they rely on Databricks for its powerful Python notebooks and strong support for data science workloads. For companies with very technical data engineers who prefer working with Apache Spark and Python, Databricks is often the preferred choice for data transformations. One advantage of Databricks for ETL use cases is the flexibility and customization of Spark. For analytical workloads that process extremely large datasets, working with Spark can sometimes be preferred since you can tune more parameters and get the job to run cheaper. In my experience, this typically only makes sense as a consideration for workloads costing over 10s of thousands of dollars per year, since the human costs associated with maintaining and programming these data pipelines will often exceed any cost savings generated on the compute side.
In terms of future roadmaps and product evolution, one of the key differentiators from Snowflake is its platform focus. In late 2023, Snowflake released Snowpark Container Services which allows customers to run containerized applications in Snowflake. When paired with their native application marketplace, it’s clear that Snowflake is building for a future where customers & partners can run any type of data application directly in Snowflake.
For Databricks, it appears they are taking an approach where customers get a managed solution for every use case out of the box. Two clear examples of this are their dashboard functionality and data catalog. With Snowflake, most customers will purchase an external BI / dashboarding tool that sits on top of Snowflake. Similarly, they will also buy a separate data catalog product to manage and keep track of all their datasets. Databricks is clearly trying to eliminate the need for customers to buy separate tools. They bought a company called Redash in 2020 and have turned that into a strong, out-of-the-box dashboard offering. Similarly, they are investing heavily in their Unity Catalog which aims to replace 3rd party data catalog vendors.
Use Cases & Key Features Comparison
During the webinar, we went through the core use cases of a data cloud platform and then listed the features from both Databricks & Snowflake for each of these use cases. The main use cases discussed were:
- Data Ingestion
- Data Transformations
- Analysis & Reporting
- ML/AI
- Data Applications
- Marketplace
- Data Governance & Management
Let’s dive into each of these in more detail.
Data Ingestion
In order to interact with data, it first needs to be loaded or “exposed” to the underlying system. For Snowflake, this usually involves running a COPY INTO
command to load the data into a table which you can then query with Snowflake. Snowflake also offers features like Snowpipe to automatically load data into Snowflake.
Most Snowflake customers will also typically use a 3rd party solution like Fivetran, Stitch or Airbyte to load data from various sources (application databases, external APIs, etc.) into Snowflake.
With Databricks, most customers instead interact with the data directly in cloud storage. Although, managed Volumes is a similar concept to Snowflake tables where Databricks manages the table for you.
With Snowflake’s investments in supporting Apache Iceberg, more customers will leave their data directly in cloud storage and interact with it there, similar to the Databricks model.
Snowflake | Databricks |
---|---|
Traditional COPY INTO | Autoloader |
Snowpipe | Native integrations (e.g. S3) |
First party connectors | Volumes |
3rd parties (Fivetran/Stitch/Airbyte) | DBFS |
No ingestion required if using Iceberg Tables |
Data Transformations
Once your data is exposed to the cloud platform, you often want to transform or enrich it in some way. Both platforms provide a variety of different solutions for doing this.
With Snowflake being a SQL-based data warehouse, most customers do their data transformations in pure SQL using a combination of tasks, stored procedures, or 3rd party transformation and orchestration tools like dbt. All SQL workloads run in Snowflake’s virtual warehouses.
On the Databricks side, most customers use Jobs, which allows you to submit a Spark job to a cluster that runs in compute instances in your cloud. With Databrick’s recent investments into their serverless SQL warehousing product, it is becoming more common to see pure SQL data transformations with tools like dbt.
Snowflake | Databricks |
---|---|
Tasks for scheduling | Jobs |
Stored Procedures | Tasks in Workflows |
Dynamic Tables / Materialized Views | Delta Live Tables (declarative ETL) |
3rd parties (dbt/Airflow/Dagster/Prefect) | SQL Warehouses |
3rd parties (dbt/Airflow/Dagster/Prefect) |
Analysis & Reporting
Both Databricks & Snowflake provide their customers with a number of features to do analysis and reporting. Snowflake allows you to create lightweight dashboards directly in Snowsight, or you can build custom data apps using Streamlit.
Databricks has a very well-built dashboarding product that some companies use in place of a 3rd party BI tool.
Snowflake | Databricks |
---|---|
Snowsight Dashboards | Notebook plots |
Streamlit | SQL Visualizations |
3rd parties (Tableau, Looker, PowerBI, etc.) | Dashboards |
3rd parties (Tableau, Looker, PowerBI, etc.) |
ML/AI
As mentioned earlier, both companies are investing heavily in ML and AI capabilities. Due to its earlier focus on this, Databricks has some more well-developed ML features like managed MLflow and Model Serving.
With the launch of Snowpark Container Services, I expect many Snowflake customers will quickly be able to start hosting ML models directly in Snowflake.
Snowflake | Databricks |
---|---|
Snowpark | MLflow |
Snowpark Container Services | Model Serving |
Snowflake Cortex | Strong Python support |
Data Applications
An interesting angle to compare Snowflake and Databricks is concerning building “data applications”. This term is admittedly broad and open to interpretation, so I’ll define a “data application” as a product or feature that is used to serve live data or insights externally to customers outside of the company. In other words, it is not some application used internally within the company.
Due to its high-performance SQL data warehouse, many companies (like SELECT) build their applications directly on top of Snowflake and serve application queries straight from Snowflake virtual warehouses. You can see more examples of this in Snowflake’s Powered By program. With new features like Container Services, it will be possible to host full web applications directly in Snowflake.
For Databricks, their main use case for “external data applications” would come from the model serving features they offer, although similar SQL query serving should soon become possible with the investments they are making in their data warehousing products.
Snowflake | Databricks |
---|---|
Serving apps from Snowflake | Model serving |
Unistore (HTAP) - hybrid tables | Triggering Jobs on the fly |
Data Sharing | Serverless SQL |
Container Services |
Marketplace
As a customer, you often want to buy additional applications or datasets to use in your data cloud platform. Snowflake is a clear winner here with a very mature marketplace filled with both datasets and native applications you can run directly in your Snowflake account.
Snowflake | Databricks |
---|---|
Very mature marketplace | Data marketplace about 1 year old |
Easily buy thousands of datasets | Technology partners |
Native Apps | Much less mature, less of a priority |
Huge focus on partners |
Data Governance & Management
On the governance and management side of things, both platforms offer features out of the box.
Snowflake makes hundreds of metadata datasets available to all customers for free in the Snowflake account usage database. They have a very advanced cost management suite including powerful features like budgets and resource monitors. They’ve recently announced Snowflake Horizon, a new set of capabilities to help you govern your data assets and users.
Databricks has a very strong data catalog offering with their Unity Catalog product, which helps customers manage and understand all the data in their environment. Databricks is much further behind on the cost management side of things, and only recently made this data accessible in system tables (their equivalent of Snowflake’s account usage views).
Snowflake | Databricks |
---|---|
Hundreds of metadata datasets (account usage / information schema) | Unity Catalog |
Snowflake Horizon | System tables |
Cost Management Suite | Compute metrics |
No visibility into cloud costs, just Databrick’s costs |
Pricing and Cost
Both Databricks and Snowflake offer usage-based pricing where you pay for what you use. To learn more about Snowflake’s pricing model, you can read our post here. Databrick’s pricing information can be found on their website. A very important thing to note with Databrick’s pricing is there are two sets of charges:
- The overhead/platform charges from Databricks
- The underlying cloud costs from AWS/Azure/GCP from the servers that Databricks spins up in those accounts
Like any usage-based cloud platform, costs can quickly skyrocket if not managed or monitored appropriately.
Is Databricks cheaper than Snowflake?
A common question many people ask is if Databricks is cheaper than Snowflake, part of which is driven by a heavy marketing effort from Databricks, pictured below from their website:
When considering the cost of any data process or application, there are two important factors to consider:
- The platform costs. The money you pay Databricks/Snowflake/your cloud provider.
- The human costs. The money you pay your employees to build and maintain the applications and processes they create.
Databricks claims that ETL workloads can be run much cheaper than Snowflake. This claim originates from the fact that Spark jobs can be heavily tuned. There are a ton of different parameters engineers can spend days (or weeks) tweaking and experimenting with.
The part that many people, including Databrick’s marketing, neglect to include when making these comparisons, is the human costs associated with all this work. In certain cases, it may make sense to pay engineers to experiment with optimizing and tuning a job, but for most ETL workloads the human overhead costs will often make the total cost higher.
When making any decisions or comparisons related to the cost of each platform, be sure to consider the total costs of ownership from (a) the platform provider and (b) the humans doing the work.
Market Share
Since Databricks is a private company, they don’t disclose their exact number of customers or penetration in each market.
One thing we did discuss in the webinar was how many customers use both platforms. The statistics in the slide below are un-verified, but due show a growing overlap between the two platforms.
Jeff and I both speculated that this overlap was due to the historically different focuses of each platform, which have since converged.