60x faster database clones in Snowflake
- Date
- Niall WoodwardCo-founder & CTO of SELECT
Snowflake's zero-copy cloning feature is extremely powerful for quickly creating production replica environments. But, anyone who has cloned a database or schema with a large number of tables has experienced that it can take over ten minutes to complete. In this post we explore a potential solution.
Introduction
I had the pleasure of attending dbt’s Coalesce conference in London last week, and dropped into a really great talk by Felipe Leite and Stephen Pastan of Miro. They mentioned how they’d achieved a considerable speed improvement by switching database clones out for multiple table clones. I had to check it out.
Experiments
Results were collected using the following query which measures the duration of each process by passing in the earliest query start time and last query end time to the DATEDIFF function:
select
count(*) as query_count,
datediff(seconds, min(start_time), max(end_time)) as duration,
sum(credits_used_cloud_services) as credits_used_cloud_services
from snowflake.account_usage.query_history where query_tag = X;
Setup
Create a database with 10 schemas, 100 tables in each:
import snowflake.connector
con = snowflake.connector.connect(
...
)
for i in range(1, 11):
con.cursor().execute(f"create schema test.schema_{i};")
for j in range(1, 101):
con.cursor().execute(f"create table test.schema_{i}.table_{j} (i number) as (select 1);")
Control - Database clone
create database test_1 clone test;
This operation took 22m 34s to execute.
Results:
Query count | Duration | Cloud services credits |
---|---|---|
1 | 22m 34s | 0.179 |
Experiment 1 - Schema level clones
import snowflake.connector
from snowflake.connector import DictCursor
def clone_database_by_schema(con, source_database, target_database):
con.cursor().execute(f"create database {target_database};")
cursor = con.cursor(DictCursor)
cursor.execute(f"show schemas in database {source_database};")
for i in cursor.fetchall():
if i["name"] not in ("INFORMATION_SCHEMA", "PUBLIC"):
con.cursor().execute_async(f"create schema {target_database}.{i['name']} clone {source_database}.{i['name']};")
con = snowflake.connector.connect(
...
session_parameters={
'QUERY_TAG': 'test 2',
}
)
clone_database_by_schema("test", "test_2")
Results:
Query count | Duration | Cloud services credits |
---|---|---|
12 | 1m 47s | 0.148 |
Using execute_async
executes each SQL statement without waiting for each to complete, resulting in all 10 schemas being cloned concurrently. A whopping 10x faster from start to finish compared with the regular database clone.
Experiment 2 - Table level clones
import snowflake.connector
from snowflake.connector import DictCursor
def clone_database_by_table(con, source_database, target_database):
con.cursor().execute(f"create database {target_database};")
cursor = con.cursor(DictCursor)
cursor.execute(f"show tables in database {source_database};")
results = cursor.fetchall()
schemas_to_create = {r['schema_name'] for r in results}
tables_to_clone = [f"{r['schema_name']}.{r['name']}" for r in results]
for schema in schemas_to_create:
con.cursor().execute(f"create schema {target_database}.{schema};")
for table in tables_to_clone:
con.cursor().execute_async(f"create table {target_database}.{table} clone {source_database}.{table};")
con = snowflake.connector.connect(
...
session_parameters={
'QUERY_TAG': 'test 3',
},
)
clone_database_by_table("test", "test_3")
This took 1 minute 48s to complete, the limiting factor being the rate at which the queries could be dispatched by the client (likely due to network waiting times). To help mitigate that, I distributed the commands across 10 threads:
import snowflake.connector
from snowflake.connector import DictCursor
import threading
class ThreadedRunCommands():
"""Helper class for running queries across a configurable number of threads"""
def __init__(self, con, threads):
self.threads = threads
self.register_command_thread = 0
self.thread_commands = [
[] for _ in range(self.threads)
]
self.con = con
def register_command(self, command):
self.thread_commands[self.register_command_thread].append(command)
if self.register_command_thread + 1 == self.threads:
self.register_command_thread = 0
else:
self.register_command_thread +=1
def run_command(self, command):
self.con.cursor().execute_async(command)
def run_commands(self, commands):
for command in commands:
self.run_command(command)
def run(self):
procs = []
for v in self.thread_commands:
proc = threading.Thread(target=self.run_commands, args=(v,))
procs.append(proc)
proc.start()
# complete the processes
for proc in procs:
proc.join()
def clone_database_by_table(con, source_database, target_database):
con.cursor().execute(f"create database {target_database};")
cursor = con.cursor(DictCursor)
cursor.execute(f"show tables in database {source_database};")
results = cursor.fetchall()
schemas_to_create = {r['schema_name'] for r in results}
tables_to_clone = [f"{r['schema_name']}.{r['name']}" for r in results]
for schema in schemas_to_create:
con.cursor().execute(f"create schema {target_database}.{schema};")
threaded_run_commands = ThreadedRunCommands(con, 10)
for table in tables_to_clone:
threaded_run_commands.register_command(f"create table {target_database}.{table} clone {source_database}.{table};")
threaded_run_commands.run()
con = snowflake.connector.connect(
...
session_parameters={
'QUERY_TAG': 'test 4',
}
)
clone_database_by_table(con, "test", "test_4")
Results:
Query count | Duration | Cloud services credits |
---|---|---|
1012 | 22s | 0.165 |
Using 10 threads, the time between the create database
command starting and the final create table ... clone
command completing was only 22 seconds. This is 60x faster than the create database ... clone
command. The bottleneck is still the rate at which queries can be dispatched.
In Summary
The complete results:
Clone strategy | Query count | End-to-end duration | Cloud services credits |
---|---|---|---|
Control - Database clone | 1 | 22m 34s | 0.179 |
Experiment 1 - Schema level clones | 12 | 1m 47s | 0.148 |
Experiment 2 - Table level clones | 1012 | 22s | 0.165 |
All the queries ran were cloud services only, and did not require a running warehouse or resume a suspended one.
I hope that Snowflake improves their schema and database clone functionality, but in the mean time, cloning tables seems to be the way to go.
Thanks again to Felipe Leite and Stephen Pastan of Miro for sharing this!