Show HN: I modeled the Voynich Manuscript with SBERT to test for structure
Perceived structure vs randomness
- Commenters broadly agree the modeled clusters, transition matrix, and section-specific patterns make the text look highly structured, not like naive random glyphs.
- Some argue this level of internal consistency would be hard to destroy even if someone tried to write “randomly,” especially for a practiced scribe.
- Others note that non-cryptographic “visual” choices (making lines look nice, filling space, avoiding or forcing repeats) could still create patterns that appear linguistic.
Hoax/gibberish vs real or constructed language
- One camp thinks the manuscript is fundamentally gibberish or a hoax/“naive art”: intentional imitation of writing without underlying language.
- Counter-arguments: statistical analyses repeatedly find language-like structure; to achieve that, a hoaxer may effectively have invented a fairly elaborate system or conlang.
- Skeptics respond that humans are poor random generators; fake language will naturally mirror properties of the author’s native language and can follow Zipf-like distributions, so “language-like” statistics don’t prove real language.
- Some distinguish between: (1) a cipher or real language; (2) a constructed or stochastic fake language; (3) unconstrained gibberish—arguing (2) is the most plausible non-linguistic explanation.
Linguistic and historical constraints
- The manuscript materials and style are consistently dated to early 15th century, ruling out some later-attribution hoax theories.
- Palimpsest hypotheses are said to be contradicted by imaging studies.
- Voynichese reportedly deviates from known languages: very few distinct signs, unusual character distributions, heavy repetition, odd lack/behavior of high-frequency words, and evidence for at least two “languages” (A/B) and multiple scribes.
Comments on the NLP approach
- Several question using an older multilingual SBERT model trained on known languages: embeddings for an unknown script may be unreliable, and suffix-stripping might remove crucial information.
- SBERT’s sentence-level design clashes with the lack of clear sentence boundaries.
- Multiple people call for controls: run the same pipeline on real texts, ciphers, and synthetic Voynich-like gibberish (some provide generators) to see if similar clustering emerges.
- There is debate over dimensionality reduction: some like PCA’s interpretability; others suggest UMAP, t‑SNE (with caveats), PaCMAP/LocalMAP, or even TDA and sparse autoencoders to probe deeper structure; also suggestions to build cluster–cluster similarity maps.
Other hypotheses and directions
- Various proposed decipherments (Germanic, Uralic, Old Turkish, recent “solutions”) are mentioned but generally described as unconvincing or non-generalizable.
- Ideas raised include brute-force word mapping with scoring by language models, distributed “SETI@home”-style search, and comparison with biblical or other 15th‑century religious texts.
- Some suggest analyzing page-to-page stylistic evolution and line-end glyph behavior as further signals of intentional structure vs decorative filler.