A new JSON data type for ClickHouse

New JSON Type & Performance

  • New JSON type and related internal types (Dynamic, Variant) aim to store JSON in a columnar, typed way instead of as plain strings.
  • Key benefits discussed: automatic column “explosion” from JSON, reduced CPU/IO versus repeatedly parsing strings, and support for mixed/dynamic schemas without predeclaring subtypes.
  • Multiple commenters note similar ideas in Snowflake (VARIANT), SingleStore, Scratchdata/DuckDB, Oracle 23ai, and possibly Google Capacitor.

What “JSON” Means Here

  • Some are confused or skeptical about calling this “JSON,” arguing it’s really an internal storage format, not a new JSON standard.
  • Others clarify: it still ingests/outputs standard JSON; the “new type” is the binary/columnar representation of JSON values, analogous to PostgreSQL/SQLite’s JSONB.
  • Distinction is made between JSON as an encoding vs JSON as a data type (numbers, strings, arrays, objects).

ClickHouse vs Other Databases

  • Widely praised as “criminally underused,” especially once Postgres hits OLAP limits (ad‑hoc aggregates, count distinct, heavy writes, 100s of GB+).
  • Compared to DuckDB: DuckDB often faster per core and great for single-node analytics; ClickHouse seen as better once you need clusters, replication, and long-term growth.
  • Pinot and Doris are suggested alternatives, with Pinot touted for very low-latency “user-facing analytics” and star-tree indexes; there is disagreement over whether ClickHouse was originally single-node or cluster-first.

Scalability & Operational Traits

  • Users report success at scales from hundreds of millions to 100B+ rows with minimal maintenance and good compression.
  • ClickHouse is described as resource-hungry and ill-suited to tightly shared environments (e.g., big nodes inside Kubernetes pods).
  • Some concern about difficulty redistributing data when adding new nodes; remedies are unclear or rely on backup/restore tools.

Ingestion & “Real-Time”

  • Several people struggle with “reliable ingestion” and are put off by Kafka/ZooKeeper complexity.
  • Others report success via:
    • Direct HTTP/driver inserts (often batched).
    • Asynchronous inserts and ReplacingMergeTree to handle duplicates.
    • Buffers/ETL tools like Vector, Fluentd, RudderStack, Tinybird, Fivetran, S3 + refreshable materialized views, or Kafka-compatible systems like Redpanda.
  • Debate over what counts as “real-time”: strict per-event, exactly-once vs micro-batched, near-real-time analytics.

JSON Typing & Elasticsearch Contrast

  • New support for “dynamically changing data” types excites users frustrated with Elasticsearch’s “first-seen type wins” behavior.
  • One proposal (in Elasticsearch context) is type-suffixed field names (e.g., value::int, value::str), but others note this complicates queries and reporting.