ClickHouse

ClickHouse Glossary

ClickHouse Glossary: Key terms for engines, joins & real-time data

Written by Meryem Cebeci06/06/2025, 10.26
hero about image

ClickHouse Glossary

All definitions in this glossary apply specifically to ClickHouse. Terms may be used differently in other systems.


Core Concepts

Cluster

A Cluster in ClickHouse is a group of interconnected nodes (servers) that work together to distribute and process data across multiple machines.

Granule

A Granule in ClickHouse is a block of rows—typically 8192—that is processed as a single unit during query execution.

Part

A part in ClickHouse refers to a physical unit of data storage on disk, created during data insertion and merged over time.

Projection

A Projection is a pre-aggregated part of a table stored within the base table, optimized for faster query execution.

Replica

A Replica is an exact copy of a table's data on another node within the same shard, ensuring availability and consistency.

Sampling

Sampling allows queries to read only a subset of data, improving performance for large datasets.

Shard

A Shard is a horizontal data partition across nodes. Each shard contains a unique data segment and may have replicas.

Sparse Index

A Sparse Index stores one entry per granule, enabling ClickHouse to skip over large irrelevant data blocks during queries.


Table Engines

AggregatingMergeTree

Stores pre-aggregated data, optimized for read performance in analytics.

Buffer

Temporarily holds data in memory and flushes it to disk-based tables, improving insert speed.

CollapsingMergeTree

Deduplicates events using a sign column—ideal for audit logs.

Distributed

A virtual table that distributes queries across nodes in a cluster.

File

Reads and writes flat file formats (CSV, JSON, etc.). Good for import/export, not production.

GraphiteMergeTree

Time-series optimized engine supporting aggregation and retention for monitoring setups.

HDFS

Enables reading from and writing to Hadoop Distributed File System (HDFS).

JDBC

Connects ClickHouse to external databases via JDBC for federated queries.

Join (Table Engine)

In-memory join engine for efficient joining with small reference tables. Can be fed by tools like GlassFlow.

Kafka

Reads real-time messages from Kafka topics. Often used with materialized views and GlassFlow.

Log

Simple, append-only engine with no indexing—mainly for testing or debugging.

MergeTree

The foundational table engine supporting indexing, partitions, and background merges.

Null

Accepts data but doesn’t store it—used for testing or intentionally discarding data.

ODBC

Connects to external databases using the ODBC interface.

ReplacingMergeTree

Deduplicates rows by primary key, optionally using a version column. Great in streaming use cases with GlassFlow.

S3

Reads/writes from Amazon S3 or compatible storage, ideal for cloud-based data.

StripeLog

Append-only storage engine for fast inserts, suitable for lightweight analytics.

SummingMergeTree

Aggregates numeric values on merge—ideal for counters, views, etc. Can be combined with GlassFlow.

TinyLog

Minimalist engine storing all data in one file, used for quick tests or small data.

VersionedCollapsingMergeTree

Like CollapsingMergeTree, but includes version control to retain the most relevant row.


Views & Virtual Tables

Dictionary View

Lets you query dictionaries as if they were tables—useful for debugging or ad hoc joins.

Live View

Continuously updates as new data arrives. Useful for monitoring and live dashboards.

Materialized View

Automatically stores query results on insert. Often used with Kafka pipelines or tools like GlassFlow.

View

A reusable SELECT query. Doesn’t store data—executes each time on access.


Join Types

ALL LEFT JOIN

Returns all left table rows, including duplicates from the right table. Includes unmatched rows as NULLs.

ANTI JOIN

Returns only left table rows without matches in the right table.

ANY INNER JOIN

Returns one arbitrary match per left row—even if more matches exist.

ANY LEFT JOIN

Same as above but includes unmatched left-side rows with NULLs.

ARRAY JOIN

Expands array values in a column into multiple rows.

ASOF JOIN

Joins time-based or ordered data using "closest earlier match" logic.

CROSS JOIN

Cartesian product of two tables—combine every row with every row.

DIRECT JOIN

Efficient join using in-memory tables created with the Join engine.

FULL OUTER JOIN

Returns all rows from both sides, matched and unmatched.

GLOBAL JOIN

Used in distributed setups—sends right-side data to all nodes.

INNER JOIN

Returns only rows with matches in both tables.

Join Key

The column(s) used to perform joins between tables.

LEFT JOIN

Returns all rows from the left table, matched or not.

RIGHT JOIN

Like LEFT JOIN but with the right table as base.

SEMI JOIN

Filters the left table to only rows that have a match—but doesn’t include columns from the right.


Indexing & Storage

Background Merges

Automatically combine smaller parts into larger ones to improve performance.

Compression Codecs

Compress data on disk to save space and improve read performance.

Data Skipping Index

Summarizes row blocks (e.g. min/max) to help skip irrelevant data during queries.

Granularity

Defines how many rows form a granule—affects filtering precision and performance.

Merge Process

Background process to combine parts, apply deduplication, and optimize storage.

MinMax Index

Stores minimum and maximum values per block to speed up filtering.

Part Size

The physical size of a part on disk—affects performance and merge behavior.

Partition

Logical segmentation of table data to isolate and manage it better.

Primary Key

Defines row order and affects performance; doesn’t enforce uniqueness.

Projections

Pre-aggregated or pre-sorted data subsets stored in the base table to speed up queries.

Sorting Key

Specifies how data is sorted within each part—important for performance.

TTL (Time to Live)

Automatically removes or moves data after a defined retention period.


Deduplication & FINAL Queries

Audit Log

A table tracking inserts, updates, or deletes over time—useful for traceability and compliance.

Deduplication

Removes duplicate rows—can be done in background merges or upstream with GlassFlow.

Deduplication Window

Limits how far back ClickHouse looks for duplicates—helps control merge cost.

FINAL Modifier

Forces query to read fully deduplicated data—slower but guarantees clean output.

is_deleted Flag

Soft deletion pattern—uses a boolean column instead of physical row removal.

Lightweight DELETE

Marks rows for deletion without immediately removing them.

Mutation

Applies updates or deletes to data in the background.

OPTIMIZE TABLE

Manually triggers merges and deduplication—used after heavy inserts.

Primary Key Deduplication

Deduplicates based on primary key—default behavior in ReplacingMergeTree.

Version Column

Defines which row version to keep during deduplication merges.


Performance & Query Optimization

Aggregating States

Partially precomputed aggregates stored for later combination in distributed queries.

Aggregation Combinators

Function modifiers like sumIf() or avgMerge() that add logic to aggregations.

Distributed Query

A query that runs in parallel across multiple cluster nodes.

Hash Join

Loads one table into memory for fast lookups during a join.

Join Algorithms

Includes Hash Join, Merge Join, Partial Merge Join—each with different performance tradeoffs.

Merge Join

Efficient for sorted tables—walks both sides in order.

Partial Merge Join

Handles partially sorted data with hybrid merge logic.

Memory Limits

Prevents queries from exceeding memory thresholds.

Parallel Processing

Splits queries into tasks that run on multiple CPU cores simultaneously.

Query Execution

The full lifecycle of processing a SQL query—filters, joins, aggregations, and result output.

Query Profiler

Tool to analyze performance hotspots in a query.

Read / Write Buffers

Temporary memory for incoming or outgoing data—tuning helps performance.

system.parts

System table showing all parts on disk—useful for debugging and monitoring.

system.query_log

Logs every executed query—great for performance audits.


Data Ingestion & Streaming

async_insert

Returns immediately after an insert, writing data in the background.

CDC (Change Data Capture)

Captures real-time changes from databases for streaming pipelines using tools like Debezium or GlassFlow.

ClickHouse Sink Connector

Kafka Connect tool for writing data from Kafka into ClickHouse.

ClickPipes

Defines streaming pipelines inside ClickHouse using SQL-like syntax.

Debezium

Reads change logs from databases and streams events via Kafka.

Dictionaries

External data used to enrich queries without joins—loaded in memory.

GlassFlow

Open-source streaming tool for deduplication and real-time joins before data enters ClickHouse.

Insert Quorum

Ensures data is written to multiple replicas before confirming success.

Kafka Connect

Ecosystem for syncing Kafka with external systems—used in ClickHouse pipelines.

Stream Processing

Processing data in real time instead of batches—often done with GlassFlow, Flink, or Kafka Streams.

Upsert Semantics

Keeps the latest version of a row, overwriting old ones based on primary key and version.


Development & Tools

Array

Stores multiple values in one column—useful for tags or grouped data.

ClickHouse Cloud

Fully managed, scalable version of ClickHouse.

ClickHouse Keeper

Built-in alternative to ZooKeeper for distributed coordination.

ClickHouse Local

Run ClickHouse queries without starting a server—great for local testing.

Denormalization

Stores related data in one table to avoid JOINs—common in analytics.

LowCardinality

Efficiently stores columns with repeated values—reduces size and speeds up queries.

Map Support

Stores key-value pairs in one column—ideal for semi-structured data.

MergeTree Settings

Customizes how MergeTree behaves—merge frequency, part size, etc.

Nested Data

Groups related fields in one column—works with arrays and structures.

SQL Extensions

Extra SQL capabilities like array functions, special JOINs, and analytics features.

Tuple Support

Groups multiple values in a single column, e.g. (user_id, score).

Window Functions

Calculates values over related rows (running totals, rankings, etc.).

ZooKeeper

External coordination tool (older setups)—often replaced by ClickHouse Keeper.

Did you like this article? Share it now!

You might also like

Try it now

Cleaned Kafka Streams for ClickHouse

Clean Data. No maintenance. Less load for ClickHouse.

GitHub Repo