All posts

Databricks vs. Snowflake: An Honest Comparison in 2024

Date
  • Ian Whitestone
    Co-founder & CEO of SELECT

Databricks & Snowflake are two of the most popular data cloud platforms in the market right now.

They started out solving very different use cases: Snowflake as the SQL data warehouse and Databricks as a managed Apache Spark service. They were even partners in the early days!

Today, they are both multi-faceted data cloud platforms solving a variety of different use cases. And, as a result, direct competitors.

Back on February 28, 2024, I had a conversation with Jeff Chou from Sync Computing. Jeff’s company works solely with Databricks customers, while we at SELECT work solely with Snowflake customers.

Due to this nature, we thought it’d be fun to get together and have a genuine conversation about each platform. Neither of us was super familiar with the other platform, but were keen to learn more from each other.

This was an honest and unscripted conversation between two practitioners. No bullshit benchmarks. No marketing fluff.

We talked about their origin stories, the most common use cases we see from actual customers, their strengths and weaknesses, and where they’re both headed.

Below, I've put together a summary of the key things we discussed.

Origin Stories

Initially, DataBricks and Snowflake started as partners, each focusing on different aspects of data management. While Snowflake specialized in data warehousing, DataBricks carved its niche in managed Spark, and then quickly expanded to machine learning (ML) workloads. Interestingly, they used to refer customers to each other.

Fast forward to the present, and both platforms have undergone remarkable transformations. If you look at their websites (snapshotted as of February 27, 2024), Snowflake is now calling itself the "data cloud", while DataBricks brands itself as the "data intelligence platform":

Snowflake vs. Databricks website branding

At the end of the day, they are both comprehensive, all-in-one data cloud platforms that serve a variety of different data use cases.

With that said it’s still quite interesting to explore their origin stories as it helps explain the relative strengths and weaknesses of each platform today.

Snowflake was founded in 2012 by data warehousing experts from Oracle, and another data warehousing company called VectorWise. Snowflake came to market 10 years ago in 2014 with their main data warehousing product, which they often referred to as the “elastic data warehouse” due to its unique architecture that allowed it to independently scale compute and storage.

Shortly after Snowflake launched, Databricks launched their first product, in a completely different space. Databricks was founded by the creators of Apache Spark, who were all academics in the high-performance computing research area at Berkeley. Their first product was a managed offering of Apache Spark, along with a notebook interface to interactively run jobs in these computing clusters.

Snowflake started expanding by introducing data-sharing functionality in 2017, followed by a marketplace where customers could buy datasets from each other in 2019.

During a similar time frame, Databricks started going deeper into the ML space by launching their managed MLFlow offering in 2019, followed by MLFlow model serving in 2020.

Snowflake vs. Databricks product and company timelines

Company Evolution

An interesting thing to observe is how each company has responded to market demands and introduced competing sets of functionality.

Snowflake developed Snowpark, initially aimed at migrating Spark workloads but evolving into a platform for Python-based ML workloads. They have also invested heavily in adding support for Apache Iceberg so that their customers can manage & leverage their data lakes directly from Snowflake.

Meanwhile, DataBricks launched features like Photon and DataBricks SQL, expanding its footprint in the data warehousing arena.

This is especially evident today when you look at the interfaces for creating a virtual warehouse in Snowflake, and “SQL warehouse” in Databricks. You can see that Databricks has pretty much just copied the design and settings of Snowflake’s virtual warehouses:

Snowflake vs. Databricks SQL warehouse comparison

While not pictured on the timeline slide above, both companies have come out with a bunch of AI/LLM-related feature announcements in late 2023 and early 2024. It’s still early days in this space, but both companies made large acquisitions in this space and are investing very heavily.

Advantages of Each Platform & Key Differentiators

It’s important to understand how each company started out, as it helps explain the relative strengths and weaknesses of each platform.

Due to its start in data warehousing, Snowflake has a much stronger and more fully featured SQL data warehousing product. For most companies, this will be the most important and most used feature, as most value generated from data strategies will come from a well-managed data warehouse that can serve core business intelligence use cases.

Very few companies I talk to are using Databricks as their “data warehouse”. Instead, they rely on Databricks for its powerful Python notebooks and strong support for data science workloads. For companies with very technical data engineers who prefer working with Apache Spark and Python, Databricks is often the preferred choice for data transformations. One advantage of Databricks for ETL use cases is the flexibility and customization of Spark. For analytical workloads that process extremely large datasets, working with Spark can sometimes be preferred since you can tune more parameters and get the job to run cheaper. In my experience, this typically only makes sense as a consideration for workloads costing over 10s of thousands of dollars per year, since the human costs associated with maintaining and programming these data pipelines will often exceed any cost savings generated on the compute side.

In terms of future roadmaps and product evolution, one of the key differentiators from Snowflake is its platform focus. In late 2023, Snowflake released Snowpark Container Services which allows customers to run containerized applications in Snowflake. When paired with their native application marketplace, it’s clear that Snowflake is building for a future where customers & partners can run any type of data application directly in Snowflake.

For Databricks, it appears they are taking an approach where customers get a managed solution for every use case out of the box. Two clear examples of this are their dashboard functionality and data catalog. With Snowflake, most customers will purchase an external BI / dashboarding tool that sits on top of Snowflake. Similarly, they will also buy a separate data catalog product to manage and keep track of all their datasets. Databricks is clearly trying to eliminate the need for customers to buy separate tools. They bought a company called Redash in 2020 and have turned that into a strong, out-of-the-box dashboard offering. Similarly, they are investing heavily in their Unity Catalog which aims to replace 3rd party data catalog vendors.

Use Cases & Key Features Comparison

During the webinar, we went through the core use cases of a data cloud platform and then listed the features from both Databricks & Snowflake for each of these use cases. The main use cases discussed were:

  1. Data Ingestion
  2. Data Transformations
  3. Analysis & Reporting
  4. ML/AI
  5. Data Applications
  6. Marketplace
  7. Data Governance & Management

Let’s dive into each of these in more detail.

Data Ingestion

In order to interact with data, it first needs to be loaded or “exposed” to the underlying system. For Snowflake, this usually involves running a COPY INTO command to load the data into a table which you can then query with Snowflake. Snowflake also offers features like Snowpipe to automatically load data into Snowflake.

Most Snowflake customers will also typically use a 3rd party solution like Fivetran, Stitch or Airbyte to load data from various sources (application databases, external APIs, etc.) into Snowflake.

With Databricks, most customers instead interact with the data directly in cloud storage. Although, managed Volumes is a similar concept to Snowflake tables where Databricks manages the table for you.

With Snowflake’s investments in supporting Apache Iceberg, more customers will leave their data directly in cloud storage and interact with it there, similar to the Databricks model.

SnowflakeDatabricks
Traditional COPY INTOAutoloader
SnowpipeNative integrations (e.g. S3)
First party connectorsVolumes
3rd parties (Fivetran/Stitch/Airbyte)DBFS
No ingestion required if using Iceberg Tables

Data Transformations

Once your data is exposed to the cloud platform, you often want to transform or enrich it in some way. Both platforms provide a variety of different solutions for doing this.

With Snowflake being a SQL-based data warehouse, most customers do their data transformations in pure SQL using a combination of tasks, stored procedures, or 3rd party transformation and orchestration tools like dbt. All SQL workloads run in Snowflake’s virtual warehouses.

On the Databricks side, most customers use Jobs, which allows you to submit a Spark job to a cluster that runs in compute instances in your cloud. With Databrick’s recent investments into their serverless SQL warehousing product, it is becoming more common to see pure SQL data transformations with tools like dbt.

SnowflakeDatabricks
Tasks for schedulingJobs
Stored ProceduresTasks in Workflows
Dynamic Tables / Materialized ViewsDelta Live Tables (declarative ETL)
3rd parties (dbt/Airflow/Dagster/Prefect)SQL Warehouses
3rd parties (dbt/Airflow/Dagster/Prefect)

Analysis & Reporting

Both Databricks & Snowflake provide their customers with a number of features to do analysis and reporting. Snowflake allows you to create lightweight dashboards directly in Snowsight, or you can build custom data apps using Streamlit.

Databricks has a very well-built dashboarding product that some companies use in place of a 3rd party BI tool.

SnowflakeDatabricks
Snowsight DashboardsNotebook plots
StreamlitSQL Visualizations
3rd parties (Tableau, Looker, PowerBI, etc.)Dashboards
3rd parties (Tableau, Looker, PowerBI, etc.)

ML/AI

As mentioned earlier, both companies are investing heavily in ML and AI capabilities. Due to its earlier focus on this, Databricks has some more well-developed ML features like managed MLflow and Model Serving.

With the launch of Snowpark Container Services, I expect many Snowflake customers will quickly be able to start hosting ML models directly in Snowflake.

SnowflakeDatabricks
SnowparkMLflow
Snowpark Container ServicesModel Serving
Snowflake CortexStrong Python support

Data Applications

An interesting angle to compare Snowflake and Databricks is concerning building “data applications”. This term is admittedly broad and open to interpretation, so I’ll define a “data application” as a product or feature that is used to serve live data or insights externally to customers outside of the company. In other words, it is not some application used internally within the company.

Due to its high-performance SQL data warehouse, many companies (like SELECT) build their applications directly on top of Snowflake and serve application queries straight from Snowflake virtual warehouses. You can see more examples of this in Snowflake’s Powered By program. With new features like Container Services, it will be possible to host full web applications directly in Snowflake.

For Databricks, their main use case for “external data applications” would come from the model serving features they offer, although similar SQL query serving should soon become possible with the investments they are making in their data warehousing products.

SnowflakeDatabricks
Serving apps from SnowflakeModel serving
Unistore (HTAP) - hybrid tablesTriggering Jobs on the fly
Data SharingServerless SQL
Container Services

Marketplace

As a customer, you often want to buy additional applications or datasets to use in your data cloud platform. Snowflake is a clear winner here with a very mature marketplace filled with both datasets and native applications you can run directly in your Snowflake account.

SnowflakeDatabricks
Very mature marketplaceData marketplace about 1 year old
Easily buy thousands of datasetsTechnology partners
Native AppsMuch less mature, less of a priority
Huge focus on partners

Data Governance & Management

On the governance and management side of things, both platforms offer features out of the box.

Snowflake makes hundreds of metadata datasets available to all customers for free in the Snowflake account usage database. They have a very advanced cost management suite including powerful features like budgets and resource monitors. They’ve recently announced Snowflake Horizon, a new set of capabilities to help you govern your data assets and users.

Databricks has a very strong data catalog offering with their Unity Catalog product, which helps customers manage and understand all the data in their environment. Databricks is much further behind on the cost management side of things, and only recently made this data accessible in system tables (their equivalent of Snowflake’s account usage views).

SnowflakeDatabricks
Hundreds of metadata datasets (account usage / information schema)Unity Catalog
Snowflake HorizonSystem tables
Cost Management SuiteCompute metrics
No visibility into cloud costs, just Databrick’s costs

Pricing and Cost

Both Databricks and Snowflake offer usage-based pricing where you pay for what you use. To learn more about Snowflake’s pricing model, you can read our post here. Databrick’s pricing information can be found on their website. A very important thing to note with Databrick’s pricing is there are two sets of charges:

  1. The overhead/platform charges from Databricks
  2. The underlying cloud costs from AWS/Azure/GCP from the servers that Databricks spins up in those accounts

Like any usage-based cloud platform, costs can quickly skyrocket if not managed or monitored appropriately.

Is Databricks cheaper than Snowflake?

A common question many people ask is if Databricks is cheaper than Snowflake, part of which is driven by a heavy marketing effort from Databricks, pictured below from their website:

Snowflake vs. Databricks pricing

When considering the cost of any data process or application, there are two important factors to consider:

  1. The platform costs. The money you pay Databricks/Snowflake/your cloud provider.
  2. The human costs. The money you pay your employees to build and maintain the applications and processes they create.

Databricks claims that ETL workloads can be run much cheaper than Snowflake. This claim originates from the fact that Spark jobs can be heavily tuned. There are a ton of different parameters engineers can spend days (or weeks) tweaking and experimenting with.

The part that many people, including Databrick’s marketing, neglect to include when making these comparisons, is the human costs associated with all this work. In certain cases, it may make sense to pay engineers to experiment with optimizing and tuning a job, but for most ETL workloads the human overhead costs will often make the total cost higher.

When making any decisions or comparisons related to the cost of each platform, be sure to consider the total costs of ownership from (a) the platform provider and (b) the humans doing the work.

Market Share

Since Databricks is a private company, they don’t disclose their exact number of customers or penetration in each market.

One thing we did discuss in the webinar was how many customers use both platforms. The statistics in the slide below are un-verified, but due show a growing overlap between the two platforms.

Jeff and I both speculated that this overlap was due to the historically different focuses of each platform, which have since converged.

Snowflake vs. Databricks market share
Ian Whitestone
Co-founder & CEO of SELECT
Ian is the Co-founder & CEO of SELECT, a software product which helps users automatically optimize, understand and monitor Snowflake usage. Prior to starting SELECT, Ian spent 6 years leading full stack data science & engineering teams at Shopify and Capital One. At Shopify, Ian led the efforts to optimize their data warehouse and increase cost observability.

Get up and running with SELECT in 15 minutes.

Snowflake optimization & cost management platform

Gain visibility into Snowflake usage, optimize performance and automate savings with the click of a button.

SELECT web application screenshot

Want to hear about our latest Snowflake learnings? 
Subscribe to get notified.