ClickHouse Glossary
All definitions in this glossary apply specifically to ClickHouse. Terms may be used differently in other systems.
Core Concepts
Cluster
A Cluster in ClickHouse is a group of interconnected nodes (servers) that work together to distribute and process data across multiple machines.
Granule
A Granule in ClickHouse is a block of rows—typically 8192—that is processed as a single unit during query execution.
Part
A part in ClickHouse refers to a physical unit of data storage on disk, created during data insertion and merged over time.
Projection
A Projection is a pre-aggregated part of a table stored within the base table, optimized for faster query execution.
Replica
A Replica is an exact copy of a table's data on another node within the same shard, ensuring availability and consistency.
Sampling
Sampling allows queries to read only a subset of data, improving performance for large datasets.
Shard
A Shard is a horizontal data partition across nodes. Each shard contains a unique data segment and may have replicas.
Sparse Index
A Sparse Index stores one entry per granule, enabling ClickHouse to skip over large irrelevant data blocks during queries.
Table Engines
AggregatingMergeTree
Stores pre-aggregated data, optimized for read performance in analytics.
Buffer
Temporarily holds data in memory and flushes it to disk-based tables, improving insert speed.
CollapsingMergeTree
Deduplicates events using a sign column—ideal for audit logs.
Distributed
A virtual table that distributes queries across nodes in a cluster.
File
Reads and writes flat file formats (CSV, JSON, etc.). Good for import/export, not production.
GraphiteMergeTree
Time-series optimized engine supporting aggregation and retention for monitoring setups.
HDFS
Enables reading from and writing to Hadoop Distributed File System (HDFS).
JDBC
Connects ClickHouse to external databases via JDBC for federated queries.
Join (Table Engine)
In-memory join engine for efficient joining with small reference tables. Can be fed by tools like GlassFlow.
Kafka
Reads real-time messages from Kafka topics. Often used with materialized views and GlassFlow.
Log
Simple, append-only engine with no indexing—mainly for testing or debugging.
MergeTree
The foundational table engine supporting indexing, partitions, and background merges.
Null
Accepts data but doesn’t store it—used for testing or intentionally discarding data.
ODBC
Connects to external databases using the ODBC interface.
ReplacingMergeTree
Deduplicates rows by primary key, optionally using a version column. Great in streaming use cases with GlassFlow.
S3
Reads/writes from Amazon S3 or compatible storage, ideal for cloud-based data.
StripeLog
Append-only storage engine for fast inserts, suitable for lightweight analytics.
SummingMergeTree
Aggregates numeric values on merge—ideal for counters, views, etc. Can be combined with GlassFlow.
TinyLog
Minimalist engine storing all data in one file, used for quick tests or small data.
VersionedCollapsingMergeTree
Like CollapsingMergeTree, but includes version control to retain the most relevant row.
Views & Virtual Tables
Dictionary View
Lets you query dictionaries as if they were tables—useful for debugging or ad hoc joins.
Live View
Continuously updates as new data arrives. Useful for monitoring and live dashboards.
Materialized View
Automatically stores query results on insert. Often used with Kafka pipelines or tools like GlassFlow.
View
A reusable SELECT query. Doesn’t store data—executes each time on access.
Join Types
ALL LEFT JOIN
Returns all left table rows, including duplicates from the right table. Includes unmatched rows as NULLs.
ANTI JOIN
Returns only left table rows without matches in the right table.
ANY INNER JOIN
Returns one arbitrary match per left row—even if more matches exist.
ANY LEFT JOIN
Same as above but includes unmatched left-side rows with NULLs.
ARRAY JOIN
Expands array values in a column into multiple rows.
ASOF JOIN
Joins time-based or ordered data using "closest earlier match" logic.
CROSS JOIN
Cartesian product of two tables—combine every row with every row.
DIRECT JOIN
Efficient join using in-memory tables created with the Join engine.
FULL OUTER JOIN
Returns all rows from both sides, matched and unmatched.
GLOBAL JOIN
Used in distributed setups—sends right-side data to all nodes.
INNER JOIN
Returns only rows with matches in both tables.
Join Key
The column(s) used to perform joins between tables.
LEFT JOIN
Returns all rows from the left table, matched or not.
RIGHT JOIN
Like LEFT JOIN but with the right table as base.
SEMI JOIN
Filters the left table to only rows that have a match—but doesn’t include columns from the right.
Indexing & Storage
Background Merges
Automatically combine smaller parts into larger ones to improve performance.
Compression Codecs
Compress data on disk to save space and improve read performance.
Data Skipping Index
Summarizes row blocks (e.g. min/max) to help skip irrelevant data during queries.
Granularity
Defines how many rows form a granule—affects filtering precision and performance.
Merge Process
Background process to combine parts, apply deduplication, and optimize storage.
MinMax Index
Stores minimum and maximum values per block to speed up filtering.
Part Size
The physical size of a part on disk—affects performance and merge behavior.
Partition
Logical segmentation of table data to isolate and manage it better.
Primary Key
Defines row order and affects performance; doesn’t enforce uniqueness.
Projections
Pre-aggregated or pre-sorted data subsets stored in the base table to speed up queries.
Sorting Key
Specifies how data is sorted within each part—important for performance.
TTL (Time to Live)
Automatically removes or moves data after a defined retention period.
Deduplication & FINAL Queries
Audit Log
A table tracking inserts, updates, or deletes over time—useful for traceability and compliance.
Deduplication
Removes duplicate rows—can be done in background merges or upstream with GlassFlow.
Deduplication Window
Limits how far back ClickHouse looks for duplicates—helps control merge cost.
FINAL Modifier
Forces query to read fully deduplicated data—slower but guarantees clean output.
is_deleted Flag
Soft deletion pattern—uses a boolean column instead of physical row removal.
Lightweight DELETE
Marks rows for deletion without immediately removing them.
Mutation
Applies updates or deletes to data in the background.
OPTIMIZE TABLE
Manually triggers merges and deduplication—used after heavy inserts.
Primary Key Deduplication
Deduplicates based on primary key—default behavior in ReplacingMergeTree.
Version Column
Defines which row version to keep during deduplication merges.
Performance & Query Optimization
Aggregating States
Partially precomputed aggregates stored for later combination in distributed queries.
Aggregation Combinators
Function modifiers like sumIf()
or avgMerge()
that add logic to aggregations.
Distributed Query
A query that runs in parallel across multiple cluster nodes.
Hash Join
Loads one table into memory for fast lookups during a join.
Join Algorithms
Includes Hash Join, Merge Join, Partial Merge Join—each with different performance tradeoffs.
Merge Join
Efficient for sorted tables—walks both sides in order.
Partial Merge Join
Handles partially sorted data with hybrid merge logic.
Memory Limits
Prevents queries from exceeding memory thresholds.
Parallel Processing
Splits queries into tasks that run on multiple CPU cores simultaneously.
Query Execution
The full lifecycle of processing a SQL query—filters, joins, aggregations, and result output.
Query Profiler
Tool to analyze performance hotspots in a query.
Read / Write Buffers
Temporary memory for incoming or outgoing data—tuning helps performance.
system.parts
System table showing all parts on disk—useful for debugging and monitoring.
system.query_log
Logs every executed query—great for performance audits.
Data Ingestion & Streaming
async_insert
Returns immediately after an insert, writing data in the background.
CDC (Change Data Capture)
Captures real-time changes from databases for streaming pipelines using tools like Debezium or GlassFlow.
ClickHouse Sink Connector
Kafka Connect tool for writing data from Kafka into ClickHouse.
ClickPipes
Defines streaming pipelines inside ClickHouse using SQL-like syntax.
Debezium
Reads change logs from databases and streams events via Kafka.
Dictionaries
External data used to enrich queries without joins—loaded in memory.
GlassFlow
Open-source streaming tool for deduplication and real-time joins before data enters ClickHouse.
Insert Quorum
Ensures data is written to multiple replicas before confirming success.
Kafka Connect
Ecosystem for syncing Kafka with external systems—used in ClickHouse pipelines.
Stream Processing
Processing data in real time instead of batches—often done with GlassFlow, Flink, or Kafka Streams.
Upsert Semantics
Keeps the latest version of a row, overwriting old ones based on primary key and version.
Development & Tools
Array
Stores multiple values in one column—useful for tags or grouped data.
ClickHouse Cloud
Fully managed, scalable version of ClickHouse.
ClickHouse Keeper
Built-in alternative to ZooKeeper for distributed coordination.
ClickHouse Local
Run ClickHouse queries without starting a server—great for local testing.
Denormalization
Stores related data in one table to avoid JOINs—common in analytics.
LowCardinality
Efficiently stores columns with repeated values—reduces size and speeds up queries.
Map Support
Stores key-value pairs in one column—ideal for semi-structured data.
MergeTree Settings
Customizes how MergeTree behaves—merge frequency, part size, etc.
Nested Data
Groups related fields in one column—works with arrays and structures.
SQL Extensions
Extra SQL capabilities like array functions, special JOINs, and analytics features.
Tuple Support
Groups multiple values in a single column, e.g. (user_id, score)
.
Window Functions
Calculates values over related rows (running totals, rankings, etc.).
ZooKeeper
External coordination tool (older setups)—often replaced by ClickHouse Keeper.