ClickHouse table graph tool

ClickHouse table graph tool

January 28, 2025

In this article, I will describe my tool clickhouse-table-graph for ClickHouse table dependencies visualization.

Introduction

I like ClickHouse a lot.

If you like ClickHouse a lot you probably like create fast and reliable data pipelines by using different kinds of ClickHouse tables and materialized views. At some point of time, you may have a lot of tables and materialized views. So it could be hard to understand the dependencies between them.

Also, I like visualization and diagrams, so I thought it would be nice to have a tool that can visualize ClickHouse table dependencies. One of my favorite tools for creating diagrams is mermaid. So, why not to use this tool to visualize ClickHouse table dependencies?

Clickhouse table graph tool overview

The initial idea was to create a Go library to build a graph of ClickHouse tables and output this graph as a mermaid diagram. But with Go it is really simple to create a CLI tool, so I decided to create a CLI tool as well.

This tool is my first Go project and I found it is fascinating to work with Go. I like the simplicity of the language and the performance of the compiled code. But probably I write Go code in Java/Kotlin style.

List of CLI options and usage you can find in README of the project. In this article I will provide some examples of how to use this tool. But before that, couple of words about how it works.

How it works

Clickhouse-table-graph uses ClickHouse system tables to get information about tables and materialized views. So you need to have access to the ClickHouse database and SELECT grants for system.tables table. system.tables table contains columns dependencies_database, dependencies_table, create_table_query, engine were from all required information about table dependencies can be retrieved.

In general, algorithm is the following:

  1. Get list of tables and materialized views from system.tables table

  2. Build set of links between tables based on the list of tables and materialized views retrieved from the system.tables

    • engine columns is used to determine the table type
    • if engine is Disributed then dependencies are extracted from the distributed engine parameters
    • if engine is MaterializedView then dependencies are extracted from the create_table_query column
    • for all other cases dependencies are extracted from the dependencies_database and dependencies_table columns
  3. Build tables graph for the desired table based on the set of links between tables

    • Starting from the desired table, algorithm goes through all linked tables and put them into an array which represents the graph
    • The depth-first search algorithm (a bit modified) is used to traverse the graph and find linked tables
      • This depth-first search algorithm is bit modified because we need only tables which are linked to the desired table by direct link.

        For example, there are 2 links in total. If we want to display graph for table_A the graph should contain all dependent tables for table_A: table_B and table_C:

           flowchart TB
           table_A --> table_B
           table_A --> table_C
           style table_A stroke:yellow

        But if we want to display graph for table_B the graph should contain only table_A, because data from table_B is not connected with data from table_C, it is like an independent branch. So there is no sense to display table_C in the chart to keep it simple and clean:

           flowchart TB
           table_A --> table_B
           style table_B stroke:yellow
  4. Output links from the graph array as a mermaid diagram

    flowchart TB
    table_A --> table_B
    table_A --> table_C 

Examples

Prepare ClickHouse tables and materialized views

Now, lets create some clickhouse tables and materialized views and visualize dependencies between them with clickhouse-table-graph tool.

I’m going to create the following structure of tables to demonstrate how the tool will display the graph depending on the desired table:

flowchart TB
input_table_A --> materialized_view_from_A_to_B
materialized_view_from_A_to_B --> table_B
table_B --> distributed_table_B

input_table_A --> materialized_view_from_A_to_C
materialized_view_from_A_to_C --> table_C
table_C --> distributed_table_C

input_table_D --> materialized_view_from_D_to_C
materialized_view_from_D_to_C --> table_C

The following DDL queries will create tables and materialized views as on the diagram above:

CREATE DATABASE IF NOT EXISTS tg;

-- create tables
-- input_table_A
CREATE TABLE IF NOT EXISTS tg.input_table_A
(
    id UInt64,
    name String
) ENGINE = Null;

-- input_table_D
CREATE TABLE IF NOT EXISTS tg.input_table_D
(
    id UInt64,
    name String
) ENGINE = Null;

-- table_B
CREATE TABLE IF NOT EXISTS tg.table_B
(
    id UInt64,
    name String
) ENGINE = MergeTree()
ORDER BY (id);

-- table_C
CREATE TABLE IF NOT EXISTS tg.table_C
(
    id UInt64,
    name String
) ENGINE = MergeTree()
ORDER BY (id);

-- create materialized views
-- materialized_view_from_A_to_B
CREATE MATERIALIZED VIEW IF NOT EXISTS tg.materialized_view_from_A_to_B TO tg.table_B AS
SELECT
    input_table_A.id as id,
    input_table_A.name as name
FROM tg.input_table_A;

-- materialized_view_from_A_to_C
CREATE MATERIALIZED VIEW IF NOT EXISTS tg.materialized_view_from_A_to_C TO tg.table_C AS
SELECT
    input_table_A.id as id,
    input_table_A.name as name
FROM tg.input_table_A;

-- materialized_view_from_D_to_C
CREATE MATERIALIZED VIEW IF NOT EXISTS tg.materialized_view_from_D_to_C TO tg.table_C AS
SELECT
    input_table_D.id as id,
    input_table_D.name as name
FROM tg.input_table_D;

-- create distributed table
-- distributed_table_B
CREATE TABLE IF NOT EXISTS tg.distributed_table_B AS tg.table_B
ENGINE = Distributed(your_cluster, tg, table_B);

-- distributed_table_C
CREATE TABLE IF NOT EXISTS tg.distributed_table_C AS tg.table_C
ENGINE = Distributed(your_cluster, tg, table_C);

Please note that you need to replace your_cluster with the name of your ClickHouse cluster defined in the clickhouse remote_servers config. Or just skip the creation of the distributed tables, it is not so important for the demonstration.

Now we cun run the clickhouse-table-graph tool to visualize the dependencies between tables.

Please note, that before running the tool you need to check out the repository and build the binary. You can find the instructions in the README.

The following steps assume that you have already built the binary chtg-cli and added it to the PATH environment variable on your system.

Example 1

For example, to visualize dependencies for tg.input_table_A execute the following command:

chtg-cli --clickhouse-table tg.input_table_A --out-format mermaid-md --table-highlight-color yellow
  • The tool will connect to localhost:9000 if --clickhouse-host and --clickhouse-port are not provided.
  • With the --table-highlight-color yellow option the desired table will be highlighted with yellow color.
  • Use --out-format to specify out format. mermaid-html will give you the entire html page which can be opened in a web-browser. mermaid-md will give you only markdown of mermaid diagram which can be used in the markdown files.
  • Use --out-file to store the graph to the specified file instead of console output.

After you input clickhouse password, the console output will be the following:

flowchart TB
%%{init: {'theme':''}}%%
tg.input_table_A@{ shape: rounded, label: "tg.input_table_A (Null)" } --> tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" }
tg.input_table_A@{ shape: rounded, label: "tg.input_table_A (Null)" } --> tg.materialized_view_from_A_to_B@{ shape: hex, label: "tg.materialized_view_from_A_to_B (MaterializedView)" }
tg.materialized_view_from_A_to_B@{ shape: hex, label: "tg.materialized_view_from_A_to_B (MaterializedView)" } --> tg.table_B@{ shape: rect, label: "tg.table_B (MergeTree)" }
tg.table_B@{ shape: rect, label: "tg.table_B (MergeTree)" } --> tg.distributed_table_B@{ shape: st-rect, label: "tg.distributed_table_B (Distributed)" }
tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" } --> tg.distributed_table_C@{ shape: st-rect, label: "tg.distributed_table_C (Distributed)" }
tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.input_table_D@{ shape: rounded, label: "tg.input_table_D (Null)" } --> tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" }
style tg.input_table_A stroke:yellow

And the above mermaid diagram will be rendered like this. You can test it by yourself and copy-paste the console output to mermaid live editor

Example 2

Now let’s try to visualize dependencies for tg.input_table_D:

chtg-cli --clickhouse-table tg.input_table_D --out-format mermaid-md --table-highlight-color yellow

In the output we will see fewer tables because the branch of input_table_A to table_B is not displayed. This branch is not connected with input_table_D so no sense to display it.

flowchart TB
%%{init: {'theme':''}}%%
tg.input_table_D@{ shape: rounded, label: "tg.input_table_D (Null)" } --> tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" }
tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" } --> tg.distributed_table_C@{ shape: st-rect, label: "tg.distributed_table_C (Distributed)" }
tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.input_table_A@{ shape: rounded, label: "tg.input_table_A (Null)" } --> tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" }
style tg.input_table_D stroke:yellow

Example 3

You can build graph for any table in the hierarchy. For example, let’s visualize dependencies for tg.table_C:

chtg-cli --clickhouse-table tg.table_C --out-format mermaid-md --table-highlight-color yellow

In the output we will see all inbound and outbound dependencies for tg.table_C:

flowchart TB
%%{init: {'theme':''}}%%
tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" } --> tg.distributed_table_C@{ shape: st-rect, label: "tg.distributed_table_C (Distributed)" }
tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.input_table_D@{ shape: rounded, label: "tg.input_table_D (Null)" } --> tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" }
tg.input_table_A@{ shape: rounded, label: "tg.input_table_A (Null)" } --> tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" }
style tg.table_C stroke:yellow

Example 4

Visualization also works for materialized views and distributed tables. Let’s visualize dependencies for materialized view tg.materialized_view_from_A_to_C:

chtg-cli --clickhouse-table tg.materialized_view_from_A_to_C --out-format mermaid-md --table-highlight-color yellow
flowchart TB
%%{init: {'theme':''}}%%
tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.input_table_A@{ shape: rounded, label: "tg.input_table_A (Null)" } --> tg.materialized_view_from_A_to_C@{ shape: hex, label: "tg.materialized_view_from_A_to_C (MaterializedView)" }
tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" } --> tg.distributed_table_C@{ shape: st-rect, label: "tg.distributed_table_C (Distributed)" }
tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" } --> tg.table_C@{ shape: rect, label: "tg.table_C (MergeTree)" }
tg.input_table_D@{ shape: rounded, label: "tg.input_table_D (Null)" } --> tg.materialized_view_from_D_to_C@{ shape: hex, label: "tg.materialized_view_from_D_to_C (MaterializedView)" }
style tg.materialized_view_from_A_to_C stroke:yellow

Conclusion

From my experience, visualisation is very handy when you work with complex data pipelines. It helps you to understand the data flow and dependencies between tables. Recently I use clickhouse-table-graph tool a lot. This tool helps me to spend less time to find the root cause of issues with data pipelines. It also helps me to explain the data flow to my colleagues and stakeholders in a simple and clear way without need to manually create diagrams and maintain it.

Thanks to mermaid, diagrams can be easily embedded into markdown files and viewed in the browser without any additional tools. For example GutHub supports mermaid diagrams in markdown files out of the box.

I hope you will find it useful as well.

Of course if you have any ideas how to improve the tool, please let me know or just create pool request on GitHub. I will be happy to hear your feedback.