LLMs are getting better at character-level text manipulation
Prompting, Guardrails, and Safety Orientation
- Early Claude models explicitly instructed themselves in the system prompt to “think step by step” and explicitly count characters; that guidance disappears in later models, suggesting improved post-training or a desire to reclaim context for other rules.
- Some see extremely long safety/system prompts as “guard rails” that trade creativity and performance for brand safety, while others argue this is precisely the responsible way to uncover and mitigate dangerous behaviors before real-world deployment.
Counting, Tools, and “Cheating”
- Many commenters argue LLMs could reliably handle character-level tasks via tools (e.g., Python), and in practice already do so when explicitly asked.
- Frustration: users must micromanage models (“use your Python tool”, include certain files, etc.), which undermines the promise of intuitive, general intelligence.
- There’s tension between wanting “pure” model ability vs. accepting tool use as legitimate intelligence, analogous to humans using calculators.
Tokenization and Architectural Limits
- Modern LLMs tokenize at subword/morpheme level, so character-level detail is below their native resolution; models must effectively “reverse engineer” tokenization to count letters.
- Tokenizing by character would help these tasks but greatly reduces effective context and efficiency under current architectures, though newer architectures (Mamba, RWKV, byte-level experiments) may mitigate this somewhat.
Training, Overfitting, and Emergent Skills
- Some see improvements (e.g., correct “count the r’s in strawberry”) as overfitting to viral test questions rather than true reasoning. Others note related tests like “b’s in blueberry” don’t show the same pattern, suggesting broader skill.
- Base64 decoding is discussed as likely emergent from web data, not explicitly optimized, whereas custom base-N encodings expose limits and inconsistencies.
Real-World Use Cases and Remaining Weaknesses
- Character-level skills matter in word games (Quartiles, Wordle-like puzzles), language-learning tasks that dissect morphology, and possibly toxicity detection where users obfuscate insults.
- Despite progress, models still fail on structured symbol tasks like Roman numerals and can hallucinate in constrained word puzzles or spelling-by-phone scenarios.
Debate Over Testing Relevance
- One side: these tests are “hammer vs screw” misuse of LLMs; just use deterministic algorithms.
- Other side: it’s informative and important that systems touted as near-human intelligence still break on seemingly simple symbolic tasks.