Snowflake Tasks are a powerful feature, commonly used for data pipeline use cases or orchestrating administrative jobs (think backups, data quality checks) and alerts. In this definitive guide, I cover everything you need to about Snowflake Tasks, from task setup and creation, the different compute models, and task management and observability.
What are Snowflake Tasks
Snowflake Tasks let you run SQL on a schedule. The SQL command can be anything. A single SQL statement, or a call of a stored procedure which invokes multiple SQL statements. Tasks are a first class citizen in Snowflake, meaning they have an owner and you can control access to them by granting privileges, just like any other Snowflake object.
How to create a Snowflake Task
Similar to other Snowflake objects, Tasks can be created and managed programatically with SQL. The most common parameters when creating a new task are:
schedule: when should be task triggered
warehouse: what compute cluster should be used
code: the SQL command to run
condition: a boolean expression that gets evaluated when the Task is triggered. It determines whether the Task will be executed or skipped if the condition is not met.
For a full list of parameters and settings associated with tasks, see the associated Snowflake documentation.
Task Creation Example
One of the key parameters you need to define when you create a Task is the
schedule parameter. This dictates when the Task will be triggered. The schedule can be defined in several ways. You can specifity it using CRON syntax, defining an interval in minutes (i.e. run every 5 minutes), or specify that the Task will run after another Task.
Let’s create a simple Task to refresh a dimensional model containing bike stations. We'll schedule this Task to run every night at 4 am:
create or replace task t_dim_stationwarehouse = compute_whschedule = 'USING CRON 0 4 * * * UTC'asmerge into dim_station t using(select distinctstart_station_id station_id,start_station_name station_name,start_station_latitude station_latitude,start_station_longitude station_longitudefrom raw_tripsunionselect distinctend_station_id station_id,end_station_name station_name,end_station_latitude station_latitude,end_station_longitude station_longitudefrom raw_trips)on t.station_id = s.station_idwhen matched then update set t.station_name = s.station_namewhen not matched theninsert (station_id, station_name, station_longitude, station_latitude)values (s.station_id, s.station_name, s.station_longitude, s.station_latitude);
This example uses the CRON syntax to schedule the Task. It might be little tricky to read the CRON syntax if you are not used to it. I recommend to use sites like crontab which can generate the right syntax for you along with an explanation.
Tasks always start out as suspended!
Whenever you create a task, it will start out in a suspended state, meaning that it is not scheduled and you have to resume it first. You can do so with the following command:
alter task t_dim_station resume;
Whenever you do any update into the Task definition (i.e. run an
alter task command), it will become suspended and needs to be resumed! In my experience, this is often the root cause of issues related to data pipelines not running. When debugging Tasks, always verify if the Task is resumed. You can get the Task state from the output of the
show tasks command.
Tasks can be chained together to create more complex data pipelines consisting of multiple tasks. Data practitioners often refer to this chain of dependencies as "DAGs" - Directed Acyclic Graphs, an example of which is shown below.
In the case of DAGs, there are certain rules and limitations which needs to be followed:
- DAG needs to have a single root task
- Only a single forward direction is supported. Your DAG cannot have loops.
- A single DAG can have up to 1000 Tasks in total
- A single Task can have up to 100 predecessors and up to 100 child tasks
Let’s add one more task into our pipeline and create a DAG. Our current task
t_dim_station refreshes the dimension table from a raw table. We are going to add one predecessor task which will be populating the raw table from external stage. We'll also update its schedule. Instead of being scheduled to run at 4AM UTC, we will add an
AFTER condition so the Task will be triggered once the predecessor Task successfully finishes.
First, we'll define the new predecessor Task
create task t_raw_tripswarehouse = compute_whschedule = 'using cron 0 4 * * * utc'ascopy into raw_trips from @s3_source_dataon_error = skip_file
Now we have to modify our initial task. First we have to remove the
SCHEDULE parameter and then add the
alter task t_dim_station unset schedule;alter task t_dim_station add after t_raw_trips;
We have now successfully formed a DAG consisting of two Tasks where
t_raw_trips is the "root task".
If we run the
show task command again we can notice the following changes:
- the original task now has no schedule
- the original task now has a predecessor task
Snowflake also provides a great UI for working with tasks, especially DAGs. You can visualize the whole pipeline from the "Graph" tab, go into task details or see the run history.
Task Compute Models
Self managed warehouses
Both our example Tasks from above were configured to run on a self managed warehouse,
COMPUTE_WH. This means the SQL commands run by the Tasks will execute on our own
COMPUTE_WH, and we are responsible for managing that warehouse (choosing the right size, cluster count, auto-suspend, etc.).
An alternative approach is to the use Serverless Tasks.
Rather than running on one of your own virtual warehouses, Serverless Tasks run on compute resources provided by Snowflake. To create a serverless Task, you just omit the
warehouse parameter when creating a new task.
One advantage of Serverless Tasks is that Snowflake will automatically determine the optimal warehouse size that the Task should run on. There is a parameter named
USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE which defines what the initial warehouse size for the first Task run will be. After few runs, Snowflake ignores that parameter and assigns the warehouse based on insights from successfully completed runs. It learns over time what is the best warehouse size for given task and dynamically changes the size as required.
If you have an existing Task that runs on one of your own warehouses, you can convert it to a Serverless Task by running the
ALTER TASK statement:
alter task t_dim_station unset warehouse;
When to use Serverless Tasks versus a managed warehouse?
According to the Snowflake Credit Consumption Table, Serverless Tasks cost 1.5x more than Tasks running on your own managed warehouse.
With some math, we can find out that using the serverless compute option is better for Tasks that take less than 40 seconds to run. Otherwise, it will be more cost effective to use the managed warehouse option.
Serverless Tasks have the benefit of being able to automatically choose the right amount of compute resources for your Task, which could lead to cost & performance benefits if you choose a warehouse size that is too small for your Task.
When it comes to managing Snowflake Tasks, there are several things you'll want to understand:
- How often is the task running?
- How long does it take to run?
- How often is it failing?
- Can we trigger alerts when the Tasks fail?
Using the Snowflake UI
Let's explore how we can use the Snowflake UI to manage Tasks.
To locate your Tasks, navigate to the database and schema where you have created them.
The Task "Details" tab gives you an overview of the whole task. You can see the definition together with key parameters like the assigned warehouse, schedule, and granted privileges. The "Graph" tab then gives you view on the overall DAG like we have seen above.
If you need to check the history runs, you can open the "Run History" tab.
The Task history is visualised as bar chart showing the number of runs over time. Individual runs are shown in the table below. Note that this page will only contain runs data from the last 7 days.
If you want to understand and monitor Tasks across your whole account, you can navigate to Activity -> Task History.
From here, you can open individual graph runs and check the status of each Task & run. This account level graph history UI is currently in Public Preview.
Programatically Monitoring Tasks
For folks who prefer code over Snowflake's UI, or would like to build their own visualisation around Tasks, Snowflake offers a number of different views which provides you data about Tasks runs, statuses and more. Let’s go through few examples.
We have already used the
show tasks command which gives you a basic overview about the Tasks in your account. It provides information like Task state, definition, predecessors, warehouse and much more.
If you need to go through task history with SQL you can either use
TASK_HISTORY() table function from the
TASK_HISTORY view in
ACCOUNT_USAGE inside the
The table function returns task runs in last 7 days and there is no latency which makes it perfect source for custom monitoring of actual runs. On the other hand the view in
ACCOUNT_USAGE has latency up to 45 minutes and contains data up to 365 days back. That’s why it is better suited for historical analysis of Tasks runs, rather then exploring current runs.
Here's an example query using the Task history table function from the information schema to show all Tasks which failed in the given time range.
select *from table(information_schema.task_history(scheduled_time_range_start=>to_timestamp_ltz('2023-10-18 00:00:00.000'),error_only => TRUE))order by scheduled_time desc
You can see that as part of the output you have information about the error together with
query_id. If you want, you can go into Query History details or query profile to find out more about that failure. If you query that table function without any parameters will get all the task executions in last 7 days together with scheduled Tasks runs.
select *from table(information_schema.task_history())order by scheduled_time desc
This approach works well when your Task triggers just a single SQL statement. But what if you have a Task which runs a stored procedure, which in turn triggers multiple statements. How can we go about identifying all associated statements?
To accomplish this, we need to leverage the
SESSION_ID attribute in the
QUERY_HISTORY view. All queries triggered by a stored procedure will most likely run within the same session. So what you need to do is first combine
QUERY_HISTORY to find out the
SESSION_ID of the stored procedure call:
selectb.session_id,a.name,a.query_text,a.query_start_time,a.completed_timefrom table(information_schema.task_history(task_name => 't_dim_station')) tinner join snowflake.account_usage.query_history qon t.query_id = q.query_id
Then you can run another query to find all the queries with same
select *from query_historywhere session_id = <<session id from previous step>>and start_time >= '<<task start time>>'and start_time <= '<<task end time>>';
Getting alerts for Task failures
Getting automatically notified when any Tasks fail is a critical requirement for most Task deployments.
Snowflake offers different features related to alerting and notifications, which can be be used for Tasks. I have written separate blogs covering these options in more details. You can check them out using the links below:
- A deep dive into Snowflake's alerting and notification features
- Error Notifications for Snowflake Tasks
We previously discussed some of the limitations related to Tasks when discussing DAGs earlier in the post. Here's a summary of all limitations related to Tasks:
- Your Task can run a single SQL statement or stored procedure call
- A single task can have up to 100 predecessors
- A single task can have up to 100 child tasks
- A single DAG can have up to 1000 tasks
- A DAG needs to have single root task
- DAGs do not support loops
- Task can’t be shared through Snowflake's data sharing capabilities
How are Tasks billed
There is no extra charges or overhead fees for using Tasks. Users are only billed for the compute resources associated with executing the SQL statements assocaited with each Task. If your tasks run on your own managed warehouses, you will be billed for those compute resources in the same way you are charged for regular SQL queries running on a warehouse. If you are using Serverless Tasks, you are billed for compute resources managed by Snowflake. As a reminder, Serverless Tasks compute 1.5x more expensive than the equivalent compute costs from using your own virtual warehouses.
Monitoring Serverless Task Costs
Serverless Task costs can be monitored using the
serverless_task_history view from the
account_usage schema. Here's an example query:
selectstart_time,end_time,task_name,credits_usedfrom snowflake.account_usage.serverless_task_historywherestart_time > current_date - 30order by start_time desc