Integrations
Custom Workloads
What are custom workloads?
Many Snowflake users leverage the platform for unique workloads, such as custom ETL pipelines, machine learning applications, or analytics applications. Users need to monitor these workloads to understand their cost and performance over time. By adding query comments or query tags to the queries run by these workloads, SELECT users can benefit from enhanced monitoring capabilities compared to the default Query Patterns workload type.
Feature Overview
The SELECT custom workload integration allows you to better understand and optimize Snowflake spend associated with these unique workloads.
- Quickly identify expensive workloads by name or application (instead of by Query Pattern), and understand how their cost and performance vary over time
- Easily dissect your Snowflake compute spend across a variety of different dimensions such as environment, team and tenant_id
- Drill into the execution history of any workload and view query profile diagnostics for each run
- Receive tailored optimization recommendations to improve workload cost and performance
The Custom Workloads feature works by adding metadata to your queries via SQL comments or Snowflake query tags. SELECT will automatically detect this metadata and intuitively surface these workloads in the UI:
Use custom workloads to ensure consistent tracking of costs
By default, SELECT categorizes compute spend into Query Patterns based on the SQL fingerprint of the query. When users make significant changes to the structure of the SQL (add new fields, filter conditions, or join in new datasets), that SQL will become part of a new Query Pattern. This makes it more difficult to understand the impact of these changes on cost and performance, as you'll have to compare two different query patterns.
With custom workloads, queries are grouped into a workload based on the query tag or comment. No matter what changes are made to the SQL, the queries will consistently be associated with the same workload specified in the query metadata, allowing you to confidently assess the impact of any changes made.
Integrating with custom workloads
To support custom workload monitoring (i.e. queries run by custom tooling), SELECT utilizes query metadata. The query metadata is a JSON object, and can be set in either the query's tag or in a comment. The schema is as follows:
{
# [REQUIRED]
"select_schema_version": "1.0.0",
# [REQUIRED]
# identifiers
"app": "generated_tables", # first grouping category
"workload_id": "my_etl_job", # secondary grouping category
# [OPTIONAL]
# If a workload contains multiple queries, include a run_id to group them together
# If run_id isn't present, each query is considered an individual run
# This value might be a date but can be anything
"run_id": "2020-01-01",
# [OPTIONAL]
# If specified, these will automatically appear in the UI as options you can filter & slice with
"environment": "dev",
"team": "data",
# [OPTIONAL]
# Custom tags - you can put whatever you want in here!
# These fields can be used for slicing and filtering in the SELECT UI
# after you've specified them in the UI under Settings -> Custom
"tenant_id": "customer 1",
"creator": "Hermione Granger",
"description": "Processes the data",
"link": "https://airflow.org.net/dag/my_etl_job",
"git_sha": "abc123",
# [OPTIONAL]
# You can add complex data types, but we currently don't support using these in the UI
"tags": ["critical", "sensitive"],
"meta": {"task_key": "A1SF249X", "attempt_number": 1}
}
Example query comments
Both multi-line and single-line comment syntax are supported. If using single-line comments, ensure that the final character in the comment is a closing curly brace and not a semi-colon.
Multi-line comment
select * from my_table
/*
{
"select_schema_version": "1.0.0",
"app": "ingest",
"workload_id": "facebook_ads"
}
*/
Single-line comment:
select * from my_table
-- {"select_schema_version": "1.0.0", "app": "ingest", "workload_id": "facebook_ads"}
Remove trailing semicolons
Your SQL statement must not have a semicolon before the query comment, or else the comment will not be sent to Snowflake:
select * from my_table; -- REMOVE THIS SEMICOLON
-- {"select_schema_version": "1.0.0", "app": "ingest", "workload_id": "facebook_ads"}
Example query tag
Alternatively, query tags can be used. The downside to using query tags is the need to both set and unset the query tag in the session, adding latency and additional complexity.
{"select_schema_version": "1.0.0", "app": "ingest", "workload_id": "facebook_ads"}
Example Python implementation using query comments
Here's an example implementation using the Snowflake Python Connector:
import json
import snowflake.connector
# Connect to Snowflake & create a cursor
conn = snowflake.connector.connect(
user='<your_username>',
password='<your_password>',
account='<your_account>',
)
cursor = conn.cursor()
# Generate query comment
select_query_comment = {
"select_schema_version": "1.0.0",
"app": "data_transformations",
"workload_id": "customers",
"environment": "prod"
}
select_query_comment_str = f"\n\n/* {json.dumps(select_query_comment)} */"
## Execute query with comment
query = """
create or replace table db.analytics.customers as (
select *
from raw_customers
where
not deleted
and created_at > '2019-01-01'
)
"""
cursor.execute(query + select_query_comment_str)
How query metadata is reflected in the SELECT UI
app and workload_id
The most important fields to set in the query comment/tag metadata are the app
and workload_id
. By default, they will show up in the UI in the main table as shown below.
The example query below would be associated with the first row shown in the table above, regardless of which user or warehouse it ran on.
select * from my_table
-- {"select_schema_version": "1.0.0", "app": "SELECT_WEB_APP", "workload_id": "DbtQueriesProcessedQuery"}
When to use run_id
Sometimes a single workload will have multiple steps associated with it. For example, a data transformation workload may run 3 queries everytime it runs:
- A query to create a temporary table
- A query to merge data from the temporary table into the main table
- A final query to delete the temporary table
To help track workloads like this, SELECT's custom workloads feature has first class support for a field called run_id
. An example workload is shown below which issues 3 queries each time it runs. Notice how each query has the same unique run_id
associated with it. The run_id
can be anything, it just needs to be unique to each run. The example below uses Python's uuid module (str(uuid.uuid4())
).
-- FIRST QUERY RUN
select
c_birth_country,
count(distinct c_customer_id)
from snowflake_sample_data.tpcds_sf10tcl.customer
group by 1
order by 2 desc
/* {"select_schema_version": "1.0.0", "app": "app_name", "workload_id": "a_small_workload", "run_id": "3b80f989-95fb-4256-a918-83d66c5298d7", "environment": "prod", "team": "data-platform", "runner": "dagster", "priority": "high"} */
-- SECOND QUERY RUN
select
d_year,
d_moy,
sum(ws_net_profit)
from
snowflake_sample_data.tpcds_sf10tcl.web_sales
inner join snowflake_sample_data.tpcds_sf10tcl.date_dim
on web_sales.ws_sold_date_sk = date_dim.d_date_sk
where
d_year between 1995 and 2000
group by 1, 2
order by 1 desc, 2 asc;
/* {"select_schema_version": "1.0.0", "app": "app_name", "workload_id": "a_small_workload", "run_id": "3b80f989-95fb-4256-a918-83d66c5298d7", "environment": "prod", "team": "data-platform", "runner": "dagster", "priority": "high"} */
-- THIRD QUERY RUN
select 3
/* {"select_schema_version": "1.0.0", "app": "app_name", "workload_id": "a_small_workload", "run_id": "3b80f989-95fb-4256-a918-83d66c5298d7", "environment": "prod", "team": "data-platform", "runner": "dagster", "priority": "high"} */
In the SELECT custom workloads UI, you'll be able to see all queries associated with each workload run, and track run duration and cost over time (as opposed to individual query duration and cost over time).
Additional Metadata
When coming up with your tagging structure, think about logical ways you may want to analyze your workloads. For example, if your pipeline has a retry mechanism built-in, you may want to add a is_retry
field to your metadata so you can analyze how much retries are costing you across all workloads.
Other helpful examples of fields you may want to add could include:
tenant_id
: i.e. 'customer_a', 'customer_b'environment
: i.e. 'dev', 'staging', 'prod', etc.department
: i.e. 'advanced analytics', 'support', 'core', etc.team
: i.e. 'data_engineering', 'analytics_engineering', 'solutions_architecture', etc.owner_email
: i.e. '[email protected]'domain
: i.e. 'marketing', 'growth'
All of these fields can be exposed as filters in the SELECT UI, allowing you to slice and dice accordingly. Consider a workload which has a query metadata field called runner
, which tracks the source system that emitted the query. You can see cost and performance data split by the runner
metadata field directly in the UI:
See the section below on "Leveraging custom metadata fields" for more information.
Leveraging custom metadata fields
Custom workloads allow users to tag their queries with comments or Snowflake query tags. Here’s an example query comment SELECT has for one of its custom workloads:
select ...
from queries
group by all
order by sum_query_cost desc nulls last
limit 1000
/* {
"select_schema_version": "1.0.0",
"app": "SELECT_WEB_APP",
"workload_id": "DailyWorkloadPoPQuery",
"environment": "prod",
"tenant_id": "scwxhob-ad38017",
"referrer_base_url": "https://select.dev/app/scwxhob-ad38017/workloads",
"is_admin_user": true,
"user_id": "google-oauth2|115381118870474481840"
}*/
Custom workload fields like app
, workload_id
and environment
are already available for filtering & slicing in the UI out of the box. To enable the same behaviour for other fields you specify in your custom workload query comment/tag, head to Settings → Custom and specify the keys you want enabled.
Once added, you can filter and slice by this metadata throught the UI!
Leverage custom meta keys in your Usage Groups
Looking to do cost chargebacks or showbacks using the metadata you've added to your query? You can now flexibly allocate costs based on custom workload metadata using SELECT's Usage Groups feature!
We don't currently support keys with complex data types
We currently do not support filtering/slicing on custom meta keys which contain complex data types like arrays or objects. If this is a requirement for you, please let us know and we will add this functionality.
select * from credit_cards
/* {
"select_schema_version": "1.0.0",
"app": "generated_tables",
"workload_id": "my_etl_job",
"tags": ["critical", "sensitive"],
"meta": {"task_key": "A1SF249X", "attempt_number": 1} }
*/
Note, you can still add this data to your query comment/tag and perform you own analysis on it. You just want be able to interact with that data in the SELECT UI as of today.