Snowflake Performance Tuning: Clustering in Snowflake

Somnath Roy
8 min readMar 21, 2023

--

In few RDBMS, clustering of the data is available in some form or the other. It optimizes the time in extracting the data. If similar data across different tables are placed in same cluster, that will immensely improve the query performance with –

1. Number of partitions hit for joining.

2. Number of partitions hit for filtering.

Snowflake also allows the user to create clusters for its tables. But there has to be a proper reason for creating the clusters. Along with the benefits (which we are going to discuss later in this), there are few drawbacks also –

1. Clustering involves compute cost, based on the clustering keys that we define — Once the data load is done, Snowflake performs background task of rearranging the micro-partitions based on the INSERT order of the clustering key & the columns present as clustering key for the table.

2. If the data changes (either INSERT or UPDATE) are done in considerable amount and the table is big in size, the re-clustering of the data will not be available immediately. So a query executed within few seconds for such table, will not be able to get the benefit of the clustering. Re-clustering will take effect after some time, based on how much data was changed & how big is the table size.

What are Micro-Partitions?

There are lot of documents available in the internet over this. I will try this one as the definition from the Snowflake pageAll data in Snowflake tables is automatically divided into micro-partitions, which are contiguous units of storage. Each micro-partition contains between 50 MB and 500 MB of uncompressed data (note that the actual size in Snowflake is smaller because data is always stored compressed). Groups of rows in tables are mapped into individual micro-partitions, organized in a columnar fashion. This size and structure allows for extremely granular pruning of very large tables, which can be comprised of millions, or even hundreds of millions, of micro-partitions.”

So if the data is stored in Micro-Partitions, it will be beneficial for the queries, if the same kind of data are stored in together so that there will be less number partition scans. Before we move further, we need to understand few terms in depth. I will take a query to explain each of these.

Sample Query –

SELECT * FROM <<table1>> WHERE <<col1>> =’STRING1' and <<col2>>=12345;

Ideally this query should be optimal it is able to scan only few partitions but not all. If Clustering is missing, then the plan will show as below

Let’s first understand the “Profile Overview”.

Total Execution Time — It shows the total execution time of the query across different aspects. The distribution of the query execution is further subdivided into its respective components.

1. Processing — time spent on data processing by the CPU.

2. Local Disk IO — time when the processing was blocked by local disk access.

3. Remote Disk IO — time when the processing was blocked by remote disk access. It is not present in the above snapshot. We will discuss this in details in upcoming blogs. Following is an example.

4. Network Communication — time when the processing was waiting for the network data transfer.

5. Synchronization — various synchronization activities between participating processes.

6. Initialization — time spent setting up the query processing.

In the “Statistics” part, we are mostly interested in the following for this discussion –

1. Partitions scanned — number of partitions scanned so far.

2. Partitions total — total number of partitions in a given table.

In our case Partition Scanned & Partitions Total are same. If this is happening for lot of queries with filter hitting the same table, then that means — Snowflake is first bringing the entire table data into memory and then putting the filter on the data. This is not the optimal situation if the table is big. In that case we have 2 options –

a) Create Clusters

b) INSERT data after ordering of the data — This will be effective if the filtering columns are not repeated in the data loads. For example — ORDER_FACT. It may contain the ORDER_ID, ORDER_DATE, ITEM_ID etc. If the filtering column is ORDER_DATE, then in a normal scenario, the ORDER_DATE will not get repeated every day. But in a scenario where the filtering column can repeat in the subsequent loads, this will not work. Now if we consider the ORDER_FACT and our filtering column is ITEM_ID, then this may not work as the ITEM_ID can get repeated in subsequent data loads.

Clusters

In Snowflake articleSnowflake supports automating these tasks by designating one or more table columns/expressions as a clustering key for the table. A table with a clustering key defined is considered to be clustered.” For the clustering to work optimally, we have to put the same data for the clustering column in the same micro-partition. To do that we have a select a clustering key(s). Based on the clustering keys content, the data will be distributed in the micro-partitions. The biggest challenge is — “which columns should be part of the clustering keys.” To make a best guess we need go through 2 step exercise –

a) Analyze — Analyze the queries which are not performing well. Decide on the joining conditions & filter conditions. Basically, you have to perform here activity modelling to get the information. Once it is done, check how the data is getting partitioned over the clustering keys.

b) Validate — Snowflake provides function SYSTEM$CLUSTERING_INFORMATION to make an estimate on the data distribution. Syntax of the function is as follows –

SELECT SYSTEM$CLUSTERING_INFORMATION(‘<<table name>>,’(<<column name>>)’);

For example — SELECT SYSTEM$CLUSTERING_INFORMATION(‘TABLE1’,’(COLUMN1, COLUMN2)’);

Following is the example of a bad clustering keys –

{

“cluster_by_keys” : “LINEAR(COLUMN1,COLUMN2)”,

“total_partition_count” : 342, ## Total number of micro-partitions that comprise the table.

“total_constant_partition_count” : 0,

“average_overlaps” : 341.0,

“average_depth” : 342.0,

“partition_depth_histogram” : {

“00000” : 0,

“00001” : 0,

“00002” : 0,

“00003” : 0,

“00004” : 0,

“00005” : 0,

“00006” : 0,

“00007” : 0,

“00008” : 0,

“00009” : 0,

“00010” : 0,

“00011” : 0,

“00012” : 0,

“00013” : 0,

“00014” : 0,

“00015” : 0,

“00016” : 0,

“00512” : 342

}

}

For us we will be concentrating on 4 attributes –

· total_constant_partition_count : Total number of micro-partitions for which the value of the specified columns have reached a constant state (i.e. the micro-partitions will not benefit significantly from re-clustering). The number of constant micro-partitions in a table has an impact on pruning for queries. The higher the number, the more micro-partitions can be pruned from queries executed on the table, which has a corresponding impact on performance.

· average_overlaps : Average number of overlapping micro-partitions for each micro-partition in the table. A high number indicates the table is not well-clustered.

· average_depth : Average overlap depth of each micro-partition in the table. A high number indicates the table is not well-clustered.

· partition_depth_histogram — distribution of overlap depth for each micro-partition in the table.

A relatively better clustering would be the following

{

“cluster_by_keys” : “LINEAR(COLUMN1,COLUMN2)”,

“total_partition_count” : 350,

“total_constant_partition_count” : 30,

“average_overlaps” : 4.3886,

“average_depth” : 3.3371,

“partition_depth_histogram” : {

“00000” : 0,

“00001” : 30,

“00002” : 23,

“00003” : 149,

“00004” : 109,

“00005” : 25,

“00006” : 14,

“00007” : 0,

“00008” : 0,

“00009” : 0,

“00010” : 0,

“00011” : 0,

“00012” : 0,

“00013” : 0,

“00014” : 0,

“00015” : 0,

“00016” : 0

}

}

This clustering is better as the average_overlaps & average_depth is lesser than the first histogram. Based on the above clustering, we will now check the coverage we will get after creating the clusters for different filter conditions.

Next step is to check the depth on applying the FILTER Condition. For that purpose, SYSTEM$CLUSTERING_DEPTH function is used.

Example — SELECT SYSTEM$CLUSTERING_DEPTH(‘TABLE1’, ‘(COLUMN1, COLUMN2)’, ‘COLUMN2 = 50’);

SELECT SYSTEM$CLUSTERING_DEPTH(‘TABLE1’, ‘(COLUMN1, COLUMN2)’, ‘COLUMN1 =’’ADCD’’ and COLUMN2 = 50') AS CLUSTER_DEPTH;

We will check 3 scenarios — 1. Normal table creation with no cluster, 2. Table with Cluster & 3. Table with no cluster and Order by INSERT statement

1. Normal table creation with no cluster

Now let’ create a table without a cluster & apply filter condition on it to check what type of query plan is getting generated. We will copy the data from the table “SNOWFLAKE_SAMPLE_DATA”.”TPCH_SF1000".”PART” into a new table.

CREATE OR REPLACE TABLE PART (

P_PARTKEY NUMBER(38,0),

P_NAME VARCHAR(55),

P_MFGR VARCHAR(25),

P_BRAND VARCHAR(10),

P_TYPE VARCHAR(25),

P_SIZE NUMBER(38,0),

P_CONTAINER VARCHAR(10),

P_RETAILPRICE NUMBER(12,2),

P_COMMENT VARCHAR(23)

);

We inserted the data as follows –

INSERT INTO DEMO.PUBLIC.PART

SELECT * FROM “SNOWFLAKE_SAMPLE_DATA”.”TPCH_SF1000".”PART”;

Now we execute the following query –

SELECT * FROM DEMO.PUBLIC.PART WHERE P_TYPE =’SMALL PLATED COPPER’ and p_size=22;

The query plan is as follows –

All the partitions were scanned and disk spilling happened as we see “Remote Disk IO”. Not a good query plan.

2. Table with Cluster

Now we do the clustering on P_TYPE,P_SIZE & apply the same filter condition. We take two query plan for the same query — one immediately after clustering the table & the other after sometime

In the table we just add the CLUSTER By clause. Remaining things remain the same –

CLUSTER BY(P_TYPE,P_SIZE);

Query –

SELECT * FROM DEMO.PUBLIC.PART_CLUSTERED

WHERE P_TYPE =’SMALL PLATED COPPER’ and p_size=22;

The following query plan is generated immediately after the table was created. Look at the Partitions scanned & Partitions total. They are same. Snowflake didn’t complete the re-arrange of the micro-partitions based on the cluster keys. That is why the full table scan was done.

Now we execute the same query after 10 mins. Following is the query plan for the same

3. Table with no cluster and Order by INSERT statement

Now we delete the data from the table DEMO.PUBLIC.PART and INSERT the data with ORDER BY Clause on P_TYPE,P_SIZE and generate the query plan for the same query.

INSERT INTO DEMO.PUBLIC.PART

SELECT *

FROM “SNOWFLAKE_SAMPLE_DATA”.”TPCH_SF1000".”PART” ORDER BY P_TYPE, P_SIZE;

The new Query plans.

But this query plan cannot true for table which are INCREMENT load in nature & the clustering column data gets repeated during subsequent loads.

Conclusion : Consider the clustering option with care & after enough analysis.

We will discuss more Performance Tuning tips in subsequent blogs.

--

--

No responses yet