2025-05-18

Show HN: I modeled the Voynich Manuscript with SBERT to test for structure

Perceived structure vs randomness

Commenters broadly agree the modeled clusters, transition matrix, and section-specific patterns make the text look highly structured, not like naive random glyphs.
Some argue this level of internal consistency would be hard to destroy even if someone tried to write “randomly,” especially for a practiced scribe.
Others note that non-cryptographic “visual” choices (making lines look nice, filling space, avoiding or forcing repeats) could still create patterns that appear linguistic.

Hoax/gibberish vs real or constructed language

One camp thinks the manuscript is fundamentally gibberish or a hoax/“naive art”: intentional imitation of writing without underlying language.
Counter-arguments: statistical analyses repeatedly find language-like structure; to achieve that, a hoaxer may effectively have invented a fairly elaborate system or conlang.
Skeptics respond that humans are poor random generators; fake language will naturally mirror properties of the author’s native language and can follow Zipf-like distributions, so “language-like” statistics don’t prove real language.
Some distinguish between: (1) a cipher or real language; (2) a constructed or stochastic fake language; (3) unconstrained gibberish—arguing (2) is the most plausible non-linguistic explanation.

Linguistic and historical constraints

The manuscript materials and style are consistently dated to early 15th century, ruling out some later-attribution hoax theories.
Palimpsest hypotheses are said to be contradicted by imaging studies.
Voynichese reportedly deviates from known languages: very few distinct signs, unusual character distributions, heavy repetition, odd lack/behavior of high-frequency words, and evidence for at least two “languages” (A/B) and multiple scribes.

Comments on the NLP approach

Several question using an older multilingual SBERT model trained on known languages: embeddings for an unknown script may be unreliable, and suffix-stripping might remove crucial information.
SBERT’s sentence-level design clashes with the lack of clear sentence boundaries.
Multiple people call for controls: run the same pipeline on real texts, ciphers, and synthetic Voynich-like gibberish (some provide generators) to see if similar clustering emerges.
There is debate over dimensionality reduction: some like PCA’s interpretability; others suggest UMAP, t‑SNE (with caveats), PaCMAP/LocalMAP, or even TDA and sparse autoencoders to probe deeper structure; also suggestions to build cluster–cluster similarity maps.

Other hypotheses and directions

Various proposed decipherments (Germanic, Uralic, Old Turkish, recent “solutions”) are mentioned but generally described as unconvincing or non-generalizable.
Ideas raised include brute-force word mapping with scoring by language models, distributed “SETI@home”-style search, and comparison with biblical or other 15th‑century religious texts.
Some suggest analyzing page-to-page stylistic evolution and line-end glyph behavior as further signals of intentional structure vs decorative filler.

Related topics