Snowflake 101: Working with Semi-Structured Data
- Date
- Tomáš SobotíkLead Data Engineer at Tietoevry
Semi-structured data has become a pillar of many data-related deliveries. It provides more flexibility compared to structured data. Such flexibility is often needed by many workloads, especially those built around various APIs, IoT sensors, or web apps. JSON has become the de facto “standard” for communication between APIs. Not being structured like traditional relational data brings challenges in how to process and work with such data structures. In the following blog posts, I would like to go through the Snowflake data platform capabilities related to processing semi-structured data. Let’s cover the whole data lifecycle, starting with ingestion, through processing, and up to serving the data to consumers.
What is Semi-Structured Data?
Semi-structured data is a type of data that does not follow the tabular structure of relational data. It can be thought of as a hybrid between structured data formats and unstructured data. It has some level of organization but is more flexible and allows for greater flexibility in how the data is stored and used.
Elements of semi-structured data can include:
Key-Value Pairs This is the most basic form of semi-structured data and is essentially a simple data structure where each item is made up of a key and a corresponding value.
Hierarchical Structures Semi-structured data can be organized into nested elements, providing a more complex hierarchical data structure.
Multi-Valued Fields Semi-structured data can contain fields that have multiple values, commonly known as arrays.
Semi-Structured JSON Data Example
Let’s take a JSON document as an example of semi-structured data because the JSON format is easy to read for humans. There is no fixed format, and the JSON could have whatever structure suits the needs of the use case. You can see a sample JSON file in the image below. JSON contains a person element with several key-value pairs and one array.
Snowflake's Native Support for Semi-Structured Data Formats
How does Snowflake work with semi-structured data? Quite well, I would say. There is native support for semi-structured data in Snowflake. This means you can natively import data from and export data to the following data formats:
- JSON
- Avro
- ORC
- Parquet
- XML
In terms of storing semi-structured data in Snowflake, there are the following native data types — ARRAY
, OBJECT
, VARIANT
.
It's also worth mentioning that you can work with semi-structured data in external tables without loading them into Snowflake. You can keep your JSONs or Parquet files in your external Data Lake and still use the data in Snowflake!
Loading Semi-Structured Data into Snowflake
For a refresher on data loading in Snowflake, be sure to check out our previous post which covers all the different options. Below, I cover two different options for semi-structured data loading specifically.
Option 1: Load the Whole File into a Single Column
In terms of loading raw semi-structured data into Snowflake tables, you have two options. Either store the whole file/document in a single column, or you can flatten the data and store individual values per column. It depends on your use case and how you want to work with the data later. I usually load data in raw format as they are and store them in a VARIANT column. Then I do flattening because it simplifies future data processing. Just to be aware, a single VARIANT value is limited to 16MB of compressed data. If you imagine a JSON file with 16 MB of compressed data — it is a pretty big file. 🙂
Option 2: Generate a Schema Based on the File Structure
The previous approach works for JSONs, but semi-structured data can also be stored in formats other than JSON. Many times, you can see data being stored in Apache Parquet format, especially in Data Lakes. Unlike JSON, which is a text-based format, Parquet is a binary format optimized for columnar storage. This means that data is stored in columns rather than rows, making it more efficient for querying and analyzing large datasets.
One of the main advantages of Parquet is its ability to handle very large datasets efficiently. Since data is stored in columns, it can be compressed and stored in a more compact form, reducing the amount of storage required and improving the performance of read operations.
Another difference between JSON and Parquet is how they handle data types. JSON is a loosely typed format, meaning it does not enforce strict data types for each field. Parquet, on the other hand, has a well-defined schema and supports various data types, including integers, floating-point numbers, strings, and timestamps. This makes it easier to work with data consistently and ensures that data is stored and processed accurately.
Such differences require different approaches when it comes to data ingestion. As Parquet has a well-defined schema, you need to define it during the data import. If you’ve ever tried to load Parquet data into Snowflake, you might have faced a time-consuming process of defining the file schema. You also need to create a target table with the exact same schema. If the file has tens or hundreds of columns, you might spend hours on this task.
Just to recall, for those of you who do not know how to load Parquet data into Snowflake, please check out the following code. You can see that as part of the copy command, you have to have the attributes from the file together with their data type.
COPY INTO <table_name>
FROM (
SELECT
$1:column1::<target_data_type>,
$1:column2::<target_data_type>,
$1:column3::<target_data_type>
FROM <my_stage>.<my_file.parquet>
);
$1
in the SELECT query refers to the single column where all the Parquet data is stored.
Before you can load the data into Snowflake, you need to first create a target table with the required schema and then write the COPY statement listing all the columns, as per the example above.
Considerations for storing semi-structured data in a single column vs. flattening the nested structured
You may now be wondering, what are some considerations for when to use each option?
If you're unsure of the future use of the data, storing it in a single VARIANT
column is a good starting point. According to the Snowflake documentation, you will get similar query performance compared to a flattened structure if the semi-structured format uses native types like strings and integers. Storage costs will also be similar.
If you are working with dates, timestamps or arrays, Snowflake claims you will get better query pruning if you store them as separate columns.
Storing frequently accessed columns/fields separately improves the end-user experience as the data becomes more straightforward to work with, eliminating the need for users to know how to parse JSON data.
Automatic Schema Detection & Table Creation
Rather than explicitly listing out all the required columns, Snowflake has automated this process. Now you can use a couple of built-in functions to help you read the schema from the file and prepare the COPY command. There is also a function to automate the table creation. I can confirm that this could save you hours of work because I have been working with very large Parquet files (100+ columns) where I had to do everything manually. Let’s look at how to use those functions and automate the whole process.
How to use INFER_SCHEMA
The INFER_SCHEMA
function automatically detects the file metadata schema in staged files that contain semi-structured data and returns the column definitions. Currently, this function works for Apache Parquet, Apache Avro, and ORC files.
It is a table function, so it needs to be wrapped by the TABLE()
keyword:
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@mystage',
FILE_FORMAT=>'my_parquet_format'
)
);
The output then looks like this:
COLUMN_NAME | TYPE | NULLABLE | EXPRESSION | FILENAMES | ORDER_ID |
---|---|---|---|---|---|
id | TEXT | True | $1:id::TEXT | customer/initial.parquet | 0 |
first_name | TEXT | True | $1:first_name::TEXT | customer/initial.parquet | 1 |
LAST_NAME | TEXT | True | $1:LAST_NAME::TEXT | customer/initial.parquet | 2 |
Please be aware that by default, the column names detected from stage files are treated as case-sensitive. This can be handled by the IGNORE_CASE => TRUE
parameter.
Now we know the structure of the staged files, but how do we move on and create a table based on that schema? We have another function for that.
How to use GENERATE_COLUMN_DESCRIPTION
This function accepts as input the output of the INFER_SCHEMA
function and generates a list of columns. We can use the output of this function for a CREATE TABLE
statement. The function has two parameters, where the second one is for defining what kind of formatting we want to have on the output side. If we are going to create a table, view, or external table. Here is a combined example with INFER_SCHEMA
:
SELECT GENERATE_COLUMN_DESCRIPTION(
ARRAY_AGG(OBJECT_CONSTRUCT(*)), 'table') AS COLUMNS
FROM TABLE (
INFER_SCHEMA(
LOCATION=>'@mystage',
FILE_FORMAT=>'my_parquet_format'
)
);
And the provided output:
+--------------------+
| COLUMN_DESCRIPTION |
|--------------------|
| "id" TEXT, |
| "first_name", TEXT |
| "LAST_NAME" TEXT |
+--------------------+
Now you can go and take the generated column list in your CREATE TABLE
statement.
How to automatically create the table based on the detected schema
Using the GENERATE_COLUMN_DESCRIPTION
works great, but if you want to automate it even further and create the table automatically based on the detected schema, then you can do so thanks to the USING TEMPLATE
keyword in the CREATE TABLE
statement. This keyword accepts as input the output from the INFER_SCHEMA
function.
CREATE TABLE mytable
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@mystage',
FILE_FORMAT=>'my_parquet_format'
)
));
Pretty cool, isn’t it? By combining several functions, we are able to automate the initial phase of semi-structured data ingestion. If you want to go further, you can even encapsulate this logic into a Python stored procedure which would accept parameters like file format, stage, and target.
Next Steps
Now that you have your semi-structured data loaded into your Snowflake account, you will want to be able to work with. Be sure to check out our next post on working with JSON data.