Show HN: I modeled the Voynich Manuscript with SBERT to test for structure

Perceived structure vs randomness

  • Commenters broadly agree the modeled clusters, transition matrix, and section-specific patterns make the text look highly structured, not like naive random glyphs.
  • Some argue this level of internal consistency would be hard to destroy even if someone tried to write “randomly,” especially for a practiced scribe.
  • Others note that non-cryptographic “visual” choices (making lines look nice, filling space, avoiding or forcing repeats) could still create patterns that appear linguistic.

Hoax/gibberish vs real or constructed language

  • One camp thinks the manuscript is fundamentally gibberish or a hoax/“naive art”: intentional imitation of writing without underlying language.
  • Counter-arguments: statistical analyses repeatedly find language-like structure; to achieve that, a hoaxer may effectively have invented a fairly elaborate system or conlang.
  • Skeptics respond that humans are poor random generators; fake language will naturally mirror properties of the author’s native language and can follow Zipf-like distributions, so “language-like” statistics don’t prove real language.
  • Some distinguish between: (1) a cipher or real language; (2) a constructed or stochastic fake language; (3) unconstrained gibberish—arguing (2) is the most plausible non-linguistic explanation.

Linguistic and historical constraints

  • The manuscript materials and style are consistently dated to early 15th century, ruling out some later-attribution hoax theories.
  • Palimpsest hypotheses are said to be contradicted by imaging studies.
  • Voynichese reportedly deviates from known languages: very few distinct signs, unusual character distributions, heavy repetition, odd lack/behavior of high-frequency words, and evidence for at least two “languages” (A/B) and multiple scribes.

Comments on the NLP approach

  • Several question using an older multilingual SBERT model trained on known languages: embeddings for an unknown script may be unreliable, and suffix-stripping might remove crucial information.
  • SBERT’s sentence-level design clashes with the lack of clear sentence boundaries.
  • Multiple people call for controls: run the same pipeline on real texts, ciphers, and synthetic Voynich-like gibberish (some provide generators) to see if similar clustering emerges.
  • There is debate over dimensionality reduction: some like PCA’s interpretability; others suggest UMAP, t‑SNE (with caveats), PaCMAP/LocalMAP, or even TDA and sparse autoencoders to probe deeper structure; also suggestions to build cluster–cluster similarity maps.

Other hypotheses and directions

  • Various proposed decipherments (Germanic, Uralic, Old Turkish, recent “solutions”) are mentioned but generally described as unconvincing or non-generalizable.
  • Ideas raised include brute-force word mapping with scoring by language models, distributed “SETI@home”-style search, and comparison with biblical or other 15th‑century religious texts.
  • Some suggest analyzing page-to-page stylistic evolution and line-end glyph behavior as further signals of intentional structure vs decorative filler.