A new JSON data type for ClickHouse
New JSON Type & Performance
- New JSON type and related internal types (
Dynamic,Variant) aim to store JSON in a columnar, typed way instead of as plain strings. - Key benefits discussed: automatic column “explosion” from JSON, reduced CPU/IO versus repeatedly parsing strings, and support for mixed/dynamic schemas without predeclaring subtypes.
- Multiple commenters note similar ideas in Snowflake (VARIANT), SingleStore, Scratchdata/DuckDB, Oracle 23ai, and possibly Google Capacitor.
What “JSON” Means Here
- Some are confused or skeptical about calling this “JSON,” arguing it’s really an internal storage format, not a new JSON standard.
- Others clarify: it still ingests/outputs standard JSON; the “new type” is the binary/columnar representation of JSON values, analogous to PostgreSQL/SQLite’s JSONB.
- Distinction is made between JSON as an encoding vs JSON as a data type (numbers, strings, arrays, objects).
ClickHouse vs Other Databases
- Widely praised as “criminally underused,” especially once Postgres hits OLAP limits (ad‑hoc aggregates, count distinct, heavy writes, 100s of GB+).
- Compared to DuckDB: DuckDB often faster per core and great for single-node analytics; ClickHouse seen as better once you need clusters, replication, and long-term growth.
- Pinot and Doris are suggested alternatives, with Pinot touted for very low-latency “user-facing analytics” and star-tree indexes; there is disagreement over whether ClickHouse was originally single-node or cluster-first.
Scalability & Operational Traits
- Users report success at scales from hundreds of millions to 100B+ rows with minimal maintenance and good compression.
- ClickHouse is described as resource-hungry and ill-suited to tightly shared environments (e.g., big nodes inside Kubernetes pods).
- Some concern about difficulty redistributing data when adding new nodes; remedies are unclear or rely on backup/restore tools.
Ingestion & “Real-Time”
- Several people struggle with “reliable ingestion” and are put off by Kafka/ZooKeeper complexity.
- Others report success via:
- Direct HTTP/driver inserts (often batched).
- Asynchronous inserts and ReplacingMergeTree to handle duplicates.
- Buffers/ETL tools like Vector, Fluentd, RudderStack, Tinybird, Fivetran, S3 + refreshable materialized views, or Kafka-compatible systems like Redpanda.
- Debate over what counts as “real-time”: strict per-event, exactly-once vs micro-batched, near-real-time analytics.
JSON Typing & Elasticsearch Contrast
- New support for “dynamically changing data” types excites users frustrated with Elasticsearch’s “first-seen type wins” behavior.
- One proposal (in Elasticsearch context) is type-suffixed field names (e.g.,
value::int,value::str), but others note this complicates queries and reporting.