Dynamic TTL for ClickHouse tables
In this article, I will describe how to implement dynamic TTL for ClickHouse tables based on the value of a specific table column. This approach allows you to set different TTLs for different rows in the table. This can be useful when you need to store data in the table for different periods of time, depending on the value of a specific column.
In order to implement dynamic TTL, we will use Dictionaries and User Defined Functions (UDFs).
Why dynamic TTL is needed
There are many cases when it’s required to have dynamic TTL based on the value of a specific column. For example:
- You have a table with some tenants related data, some tenants want to store their data for 1 year, some for 3 years, and some for 5 years. This case become more interesting if some tenants are ready to pay for extra data retention. Also, it’s important for GDPR compliance.
- You have a table which contains data for different countries. Government regulations require you to store data for different periods of time depending on the country.
Data deletion in ClickHouse
ClickHouse provides the following features for data deletion:
- Table TTL or Column TTL. ClickHouse has the TTL feature. But it’s not possible to set TTL for a specific row, like in Cassandra, so all data in the table will have the same TTL.
- Delete mutations
ALTER TABLE table_name DELETE WHERE condition
. This is a manual way to delete data from the table based on the condition. Mutations requires a lof of resources and if there are many ongoing mutations in the same time it will lead to notable performance degradation. So it’s not suitable if you have many tables which need to have dynamic data retention. - Lightweight deletes.
DELETE FROM table_name WHERE condition
ClickHouse supports lightweight deletes, this mechanism does not delete data physically from the disk immediately, but marked as deleted. The marked data will be deleted during the regular background merge process. This feature is also useful for achieve dynamic data retention, but it’s require some external service which will periodically perform delete queries.
Solution overview
The main idea of the solution is to provide TTL value during row insertion based on the value of a specific column. To achieve this, we will use the following components:
- Dictionaries. Each dictionary will contain the mapping between the column value and the TTL value. For example, the dictionary for the tenant’s specific TTL will contain the tenant ids and TTL values, the dictionary for the countries will contain the country ids and TTL values.
- User Defined Functions (UDFs). UDFs will implement logic to get the TTL value from the dictionary based on the column value. Actually it’s possible to write this logic directly in the TTL expression, but since the logic can be complex and can be reused in multiple target tables, it’s better to use UDFs to encapsulate this logic.
- TTL expression. The table definition should contain the TTL expression. In this expression, we will call the UDF to get the TTL value based on the column value.
Dictionaries
Dictionaries will provide retention value for a specific column value. It can be tenantId, countryId or whatever you want.
Please note that ClickHouse does not allow us to query regular tables in TTL expression for performance reasons. So we need to use dictionaries to query TTL values.
Example of dictionary definitions for tenants and countries:
CREATE DICTIONARY tenant_ttl_dict
(
tenant_id UInt32,
ttl_days UInt32
)
PRIMARY KEY user_id
SOURCE(...)
LAYOUT(...)
LIFETIME(...)
CREATE DICTIONARY country_ttl_dict
(
country_id UInt32,
ttl_days UInt32
)
PRIMARY KEY country_id
SOURCE(...)
LAYOUT(...)
LIFETIME(...)
Important point here: dictionaries which are used in TTL expression should be loaded right after ClickHouse server start, before ClickHouse loads metadata for the target table. Otherwise, ClickHouse will not be able to start.
So the dictionaries should be created in the default
database or created with a xml configuration file, it will guarantee that the dictionaries will be loaded and initialized before the target table metadata.
With these dictionaries we can get ttl_days value for a specific tenant_id or country_id by using built-in ClickHouse functions:
dictGetOrNull('tenant_ttl_dict', 'ttl_days', tenant_id_column)
dictGetOrNull('country_ttl_dict', 'ttl_days', country_id_column)
-- or other variants of the dictGet* functions
User Defined Functions (UDFs)
User Defined Functions (UDFs) allow us to encapsulate complex logic, reuse it and simplify TTL expression. For instance, if we want to implement some hierarchical TTL logic:
- If the there is a tenant specific TTL value, use it.
- If there is no a tenant specific TTL value, use a country specific TTL value.
- If there is no a country specific TTL value, use default TTL value. We can define set of UDFs for each usecase:
CREATE FUNCTION get_tenant_ttl
AS (tenant_id_column, default_ttl_days) -> dictGetOrDefault(tenant_ttl_dict, 'ttl_days', tenant_id_column, default_ttl_days);
CREATE FUNCTION get_country_ttl
AS (country_id_column, default_ttl_days) -> dictGetOrDefault(country_ttl_dict, 'ttl_days', country_id_column, default_ttl_days);
CREATE FUNCTION get_ttl
AS (tenant_id_column, country_id_column, default_ttl_days) -> coalesce(
dictGetOrNull(tenant_ttl_dict, 'ttl_days', tenant_id_column),
dictGetOrNull(country_ttl_dict, 'ttl_days', country_id_column),
default_ttl_days
);
Table structure
Once we have dictionaries and UDFs, we can define table with a MATERIALIZED
column which will calculate and store value for TTL expression:
CREATE TABLE table_tenant_data
(
tenant_id UInt32,
country_id UInt32,
created_at DateTime,
data String,
retention_days UInt32 MATERIALIZED get_ttl(tenant_id, country_id, 365)
)
ENGINE MergeTree()
ORDER BY (tenant_id, country_id, created_at)
TTL created_at + INTERVAL retention_days DAY;
Please note that it’s not possible use the UDFs or dictGet*
functions directly in the TTL expression, so we need to use a materialized column to store the TTL value.
For example, ClickHouse will not allow the following query:
CREATE TABLE table_tenant_data
(
tenant_id UInt32,
country_id UInt32,
created_at DateTime,
data String
)
ENGINE MergeTree()
ORDER BY (tenant_id, country_id, created_at)
TTL created_at + INTERVAL get_ttl(tenant_id, country_id, 365) DAY;
DB::Exception: TTL expression cannot contain non-deterministic functions, but contains function dictGetOrDefault. (BAD_ARGUMENTS)
Tests and monitoring
There several system tables in ClickHouse which can be useful for checking the actual TTL values and state of the system:
system.dictionaries
system.dictionaries table contains information about all dictionaries in the system. It can be useful to check the state of the dictionaries and the last load time.
system.functions
system.functions table contains information about all UDFs in the system.
system.parts
The actual TTL value for a table can be checked by querying the system.parts table. There are 2 useful columns:
delete_ttl_info_min
- the minimum TTL value for the table.delete_ttl_info_max
- the maximum TTL value for the table.
Once you have inserted some data into the table, you can check the actual TTL values by querying these columns:
SELECT delete_ttl_info_min, delete_ttl_info_max FROM system.parts WHERE table='table_tenant_data';
Drawbacks
- Performance. Since UDFs and dictionaries are called for each row during insertion, it can have performance impact. On practice, it’s not so critical and depends on the UDF logic and dictionaries performance. In our current production environment it cost us about 30% of insert performance.
- TTL for existing data is not updated. If you change the TTL value in the dictionary, it will not affect the existing data in the table. The new TTL value will be applied only for new inserted rows. In order to update the TTL value for existing data, you need to perform MATERIALIZE TTL operations which is a very expensive operation.
- Complexity. The solution requires additional components like dictionaries and UDFs, which can be complex to maintain and support.
Conclusion
Despite the fact that ClickHouse does not have dynamic TTL mechanism out of the box, it’s possible to implement it using dictionaries. The approach described in this article allows you to set different TTLs for different rows in the table based on the value of a specific column. This approach is successfully used in production environment, although it brings some performance overhead and complexity, but it’s so high price for this functionality.