2024-09-15

g1: Using Llama-3.1 70B on Groq to create o1-like reasoning chains

Prompting, Compliance, and “Don’t Hallucinate”

Commenters note that simple prompt tweaks (all caps, “don’t hallucinate”, “admit when you don’t know”) often improve outputs, but disagree on why.
Some argue the model learns to treat such phrasing as “be more conservative / factual / context-bound.”
Others are skeptical: without external ground truth, a model can’t truly detect hallucinations; prompts may mostly shift style or liability posture.
Experience varies by model: some local models follow “say you don’t know” well; others ignore it and confabulate.

g1 Prompt and Limits of Prompt Engineering

g1 is essentially a complex system prompt around step‑by‑step reasoning, JSON structure, and explicit self‑doubt and re‑examination.
Several commenters say this is “just CoT in a loop,” not o1‑like reasoning; Python orchestration is seen as mostly boilerplate.
Alternative prompts that encourage hidden scratchpad thinking and exhaustive chains of thought sometimes work better on specific puzzles but still fail on simple tests like “three sentences that end in ‘is’.”
Consensus: prompt engineering alone cannot reach o1 performance; it helps but hits clear ceilings.

o1 vs Chain‑of‑Thought, Trees, and RL

Strong debate on what makes o1 different:
- One side: it’s CoT plus heavy reinforcement learning and curated reasoning traces, not a simple prompt.
- Others speculate about internal tree search or MCTS‑style mechanisms, citing hiring histories and test‑time compute scaling.
- Counterpoint: public statements say o1 is “just a model” at inference, not a multi‑system pipeline; details remain unclear.
Multiple people stress that aligning long reasoning chains and collecting high‑quality CoT data is nontrivial and likely the main innovation.

Model Self‑Knowledge and Limitations

Some doubt LLMs can reliably know when they’re wrong, since they lack access to their training corpus and true uncertainty.
Others suggest models can still be trained to behave as if aware of limitations (e.g., knowledge cutoffs, math unreliability) using textual descriptions and feedback.

Evals, Benchmarks, and Related Projects

Several forks adapt the idea to local models (Ollama, small Llamas, Phi‑3, etc.); anecdotal results are mixed, often failing classic trick questions (e.g., “strawberry,” obfuscated_fibonacci).
Calls for robust benchmarks (MMLU‑pro, LiveBench, etc.), with some lament that projects become less “fun” once serious evals expose limits.
Reflection‑style fine‑tunes are mentioned, with strong skepticism that prior public “reasoning” models matched their claims.

Broader Reflections

Some argue “reasoning” tasks like counting, planning, and formal inference might be better handled by hybrid systems combining LLMs with classical algorithms or search.
There is brief discussion of energy use and dataset “junk”; opinions differ on how much removing low‑value facts or languages would actually help intelligence vs. harm it.

Related topics