ClickHouse table graph tool
In this article, I will describe my tool clickhouse-table-graph for ClickHouse table dependencies visualization.
Introduction
I like ClickHouse a lot.
If you like ClickHouse a lot you probably like create fast and reliable data pipelines by using different kinds of ClickHouse tables and materialized views. At some point of time, you may have a lot of tables and materialized views. So it could be hard to understand the dependencies between them.
Also, I like visualization and diagrams, so I thought it would be nice to have a tool that can visualize ClickHouse table dependencies. One of my favorite tools for creating diagrams is mermaid. So, why not to use this tool to visualize ClickHouse table dependencies?
Clickhouse table graph tool overview
The initial idea was to create a Go library to build a graph of ClickHouse tables and output this graph as a mermaid diagram. But with Go it is really simple to create a CLI tool, so I decided to create a CLI tool as well.
This tool is my first Go project and I found it is fascinating to work with Go. I like the simplicity of the language and the performance of the compiled code. But probably I write Go code in Java/Kotlin style.
List of CLI options and usage you can find in README of the project. In this article I will provide some examples of how to use this tool. But before that, couple of words about how it works.
How it works
Clickhouse-table-graph uses ClickHouse system tables to get information about tables and materialized views.
So you need to have access to the ClickHouse database and SELECT grants for system.tables
table.
system.tables
table contains columns dependencies_database
, dependencies_table
, create_table_query
, engine
were from all required information about table dependencies can be retrieved.
In general, algorithm is the following:
-
Get list of tables and materialized views from
system.tables
table -
Build set of links between tables based on the list of tables and materialized views retrieved from the
system.tables
engine
columns is used to determine the table type- if engine is
Disributed
then dependencies are extracted from the distributed engine parameters - if engine is
MaterializedView
then dependencies are extracted from thecreate_table_query
column - for all other cases dependencies are extracted from the
dependencies_database
anddependencies_table
columns
-
Build tables graph for the desired table based on the set of links between tables
- Starting from the desired table, algorithm goes through all linked tables and put them into an array which represents the graph
- The depth-first search algorithm (a bit modified) is used to traverse the graph and find linked tables
-
This depth-first search algorithm is bit modified because we need only tables which are linked to the desired table by direct link.
For example, there are 2 links in total. If we want to display graph for
table_A
the graph should contain all dependent tables fortable_A
:table_B
andtable_C
:flowchart TB table_A --> table_B table_A --> table_C style table_A stroke:yellow
But if we want to display graph for
table_B
the graph should contain onlytable_A
, because data fromtable_B
is not connected with data fromtable_C
, it is like an independent branch. So there is no sense to displaytable_C
in the chart to keep it simple and clean:flowchart TB table_A --> table_B style table_B stroke:yellow
-
-
Output links from the graph array as a mermaid diagram
flowchart TB table_A --> table_B table_A --> table_C
Examples
Prepare ClickHouse tables and materialized views
Now, lets create some clickhouse tables and materialized views and visualize dependencies between them with clickhouse-table-graph tool.
I’m going to create the following structure of tables to demonstrate how the tool will display the graph depending on the desired table:
flowchart TB input_table_A --> materialized_view_from_A_to_B materialized_view_from_A_to_B --> table_B table_B --> distributed_table_B input_table_A --> materialized_view_from_A_to_C materialized_view_from_A_to_C --> table_C table_C --> distributed_table_C input_table_D --> materialized_view_from_D_to_C materialized_view_from_D_to_C --> table_C
The following DDL queries will create tables and materialized views as on the diagram above:
CREATE DATABASE IF NOT EXISTS tg;
-- create tables
-- input_table_A
CREATE TABLE IF NOT EXISTS tg.input_table_A
(
id UInt64,
name String
) ENGINE = Null;
-- input_table_D
CREATE TABLE IF NOT EXISTS tg.input_table_D
(
id UInt64,
name String
) ENGINE = Null;
-- table_B
CREATE TABLE IF NOT EXISTS tg.table_B
(
id UInt64,
name String
) ENGINE = MergeTree()
ORDER BY (id);
-- table_C
CREATE TABLE IF NOT EXISTS tg.table_C
(
id UInt64,
name String
) ENGINE = MergeTree()
ORDER BY (id);
-- create materialized views
-- materialized_view_from_A_to_B
CREATE MATERIALIZED VIEW IF NOT EXISTS tg.materialized_view_from_A_to_B TO tg.table_B AS
SELECT
input_table_A.id as id,
input_table_A.name as name
FROM tg.input_table_A;
-- materialized_view_from_A_to_C
CREATE MATERIALIZED VIEW IF NOT EXISTS tg.materialized_view_from_A_to_C TO tg.table_C AS
SELECT
input_table_A.id as id,
input_table_A.name as name
FROM tg.input_table_A;
-- materialized_view_from_D_to_C
CREATE MATERIALIZED VIEW IF NOT EXISTS tg.materialized_view_from_D_to_C TO tg.table_C AS
SELECT
input_table_D.id as id,
input_table_D.name as name
FROM tg.input_table_D;
-- create distributed table
-- distributed_table_B
CREATE TABLE IF NOT EXISTS tg.distributed_table_B AS tg.table_B
ENGINE = Distributed(your_cluster, tg, table_B);
-- distributed_table_C
CREATE TABLE IF NOT EXISTS tg.distributed_table_C AS tg.table_C
ENGINE = Distributed(your_cluster, tg, table_C);
Please note that you need to replace
your_cluster
with the name of your ClickHouse cluster defined in the clickhouseremote_servers
config. Or just skip the creation of the distributed tables, it is not so important for the demonstration.
Now we cun run the clickhouse-table-graph tool to visualize the dependencies between tables.
Please note, that before running the tool you need to check out the repository and build the binary. You can find the instructions in the README.
The following steps assume that you have already built the binary
chtg-cli
and added it to the PATH environment variable on your system.
Example 1
For example, to visualize dependencies for tg.input_table_A
execute the following command:
chtg-cli --clickhouse-table tg.input_table_A --out-format mermaid-md --table-highlight-color yellow
- The tool will connect to
localhost:9000
if--clickhouse-host
and--clickhouse-port
are not provided. - With the
--table-highlight-color yellow
option the desired table will be highlighted with yellow color. - Use
--out-format
to specify out format.mermaid-html
will give you the entire html page which can be opened in a web-browser.mermaid-md
will give you only markdown of mermaid diagram which can be used in the markdown files. - Use
--out-file
to store the graph to the specified file instead of console output.
After you input clickhouse password, the console output will be the following:
flowchart TB
%%{init: {'theme':''}}%%
tg.input_table_A@{ shape: rounded, label: "tg.input_table_A (Null)" } --> tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" }
tg.input_table_A@{ shape: rounded, label: "tg.input_table_A (Null)" } --> tg.materialized_view_from_A_to_B@{ shape: hex, label: "tg.materialized_view_from_A_to_B (MaterializedView)" }
tg.materialized_view_from_A_to_B@{ shape: hex, label: "tg.materialized_view_from_A_to_B (MaterializedView)" } --> tg.table_B@{ shape: rect, label: "tg.table_B (MergeTree)" }
tg.table_B@{ shape: rect, label: "tg.table_B (MergeTree)" } --> tg.distributed_table_B@{ shape: st-rect, label: "tg.distributed_table_B (Distributed)" }
tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" } --> tg.distributed_table_C@{ shape: st-rect, label: "tg.distributed_table_C (Distributed)" }
tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.input_table_D@{ shape: rounded, label: "tg.input_table_D (Null)" } --> tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" }
style tg.input_table_A stroke:yellow
And the above mermaid diagram will be rendered like this. You can test it by yourself and copy-paste the console output to mermaid live editor
Example 2
Now let’s try to visualize dependencies for tg.input_table_D
:
chtg-cli --clickhouse-table tg.input_table_D --out-format mermaid-md --table-highlight-color yellow
In the output we will see fewer tables because the branch of input_table_A
to table_B
is not displayed. This branch is not connected with input_table_D
so no sense to display it.
flowchart TB
%%{init: {'theme':''}}%%
tg.input_table_D@{ shape: rounded, label: "tg.input_table_D (Null)" } --> tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" }
tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" } --> tg.distributed_table_C@{ shape: st-rect, label: "tg.distributed_table_C (Distributed)" }
tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.input_table_A@{ shape: rounded, label: "tg.input_table_A (Null)" } --> tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" }
style tg.input_table_D stroke:yellow
Example 3
You can build graph for any table in the hierarchy. For example, let’s visualize dependencies for tg.table_C
:
chtg-cli --clickhouse-table tg.table_C --out-format mermaid-md --table-highlight-color yellow
In the output we will see all inbound and outbound dependencies for tg.table_C
:
flowchart TB
%%{init: {'theme':''}}%%
tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" } --> tg.distributed_table_C@{ shape: st-rect, label: "tg.distributed_table_C (Distributed)" }
tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.input_table_D@{ shape: rounded, label: "tg.input_table_D (Null)" } --> tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" }
tg.input_table_A@{ shape: rounded, label: "tg.input_table_A (Null)" } --> tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" }
style tg.table_C stroke:yellow
Example 4
Visualization also works for materialized views and distributed tables. Let’s visualize dependencies for materialized view tg.materialized_view_from_A_to_C
:
chtg-cli --clickhouse-table tg.materialized_view_from_A_to_C --out-format mermaid-md --table-highlight-color yellow
flowchart TB
%%{init: {'theme':''}}%%
tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.input_table_A@{ shape: rounded, label: "tg.input_table_A (Null)" } --> tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" }
tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" } --> tg.distributed_table_C@{ shape: st-rect, label: "tg.distributed_table_C (Distributed)" }
tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.input_table_D@{ shape: rounded, label: "tg.input_table_D (Null)" } --> tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" }
style tg.materialized_view_from_A_to_C stroke:yellow
Conclusion
From my experience, visualisation is very handy when you work with complex data pipelines. It helps you to understand the data flow and dependencies between tables. Recently I use clickhouse-table-graph tool a lot. This tool helps me to spend less time to find the root cause of issues with data pipelines. It also helps me to explain the data flow to my colleagues and stakeholders in a simple and clear way without need to manually create diagrams and maintain it.
Thanks to mermaid, diagrams can be easily embedded into markdown files and viewed in the browser without any additional tools. For example GutHub supports mermaid diagrams in markdown files out of the box.
I hope you will find it useful as well.
Of course if you have any ideas how to improve the tool, please let me know or just create pool request on GitHub. I will be happy to hear your feedback.