2024-10-22

A new JSON data type for ClickHouse

New JSON Type & Performance

New JSON type and related internal types (Dynamic, Variant) aim to store JSON in a columnar, typed way instead of as plain strings.
Key benefits discussed: automatic column “explosion” from JSON, reduced CPU/IO versus repeatedly parsing strings, and support for mixed/dynamic schemas without predeclaring subtypes.
Multiple commenters note similar ideas in Snowflake (VARIANT), SingleStore, Scratchdata/DuckDB, Oracle 23ai, and possibly Google Capacitor.

What “JSON” Means Here

Some are confused or skeptical about calling this “JSON,” arguing it’s really an internal storage format, not a new JSON standard.
Others clarify: it still ingests/outputs standard JSON; the “new type” is the binary/columnar representation of JSON values, analogous to PostgreSQL/SQLite’s JSONB.
Distinction is made between JSON as an encoding vs JSON as a data type (numbers, strings, arrays, objects).

ClickHouse vs Other Databases

Widely praised as “criminally underused,” especially once Postgres hits OLAP limits (ad‑hoc aggregates, count distinct, heavy writes, 100s of GB+).
Compared to DuckDB: DuckDB often faster per core and great for single-node analytics; ClickHouse seen as better once you need clusters, replication, and long-term growth.
Pinot and Doris are suggested alternatives, with Pinot touted for very low-latency “user-facing analytics” and star-tree indexes; there is disagreement over whether ClickHouse was originally single-node or cluster-first.

Scalability & Operational Traits

Users report success at scales from hundreds of millions to 100B+ rows with minimal maintenance and good compression.
ClickHouse is described as resource-hungry and ill-suited to tightly shared environments (e.g., big nodes inside Kubernetes pods).
Some concern about difficulty redistributing data when adding new nodes; remedies are unclear or rely on backup/restore tools.

Ingestion & “Real-Time”

Several people struggle with “reliable ingestion” and are put off by Kafka/ZooKeeper complexity.
Others report success via:
- Direct HTTP/driver inserts (often batched).
- Asynchronous inserts and ReplacingMergeTree to handle duplicates.
- Buffers/ETL tools like Vector, Fluentd, RudderStack, Tinybird, Fivetran, S3 + refreshable materialized views, or Kafka-compatible systems like Redpanda.
Debate over what counts as “real-time”: strict per-event, exactly-once vs micro-batched, near-real-time analytics.

JSON Typing & Elasticsearch Contrast

New support for “dynamically changing data” types excites users frustrated with Elasticsearch’s “first-seen type wins” behavior.
One proposal (in Elasticsearch context) is type-suffixed field names (e.g., value::int, value::str), but others note this complicates queries and reporting.

Related topics