Yi-Coder: A Small but Mighty LLM for Code

Benchmarking and Comparisons

  • Several commenters criticize Yi-Coder’s comparisons for using older DeepSeek-Coder-v1/33B instead of newer DeepSeek-Coder-v2 and v2-Lite, calling it a marketing-style benchmark trend.
  • When compared to DeepSeek-Coder-V2-Lite-Instruct (16B) on LiveCodeBench, Yi-Coder 9B is said to be slightly behind but “respectably close” given the size gap, while the full DeepSeek-Coder-V2 236B is described as “way ahead.”
  • Some users are curious how Yi-Coder would do on more demanding suites like SWE-bench.

Multi-language vs Single-language Code Models

  • One camp wants highly specialized, single-language models (e.g., Python-only) for deeper nuance and smaller size.
  • Others argue that cross-language training helps models generalize and often improves performance even on a single language.
  • Cited patterns: diverse data often beats repeated epochs on homogeneous data; high-quality data beats sheer volume.
  • There is interest in possibly translating multi-language corpora into one language, then training a smaller, focused model.

Privacy, Terms, and Geopolitics

  • DeepSeek’s cloud offering raises concerns: data stored on servers in China, broad licenses over user inputs/outputs, and corporate opacity.
  • Some see this as acceptable for open-source or public work, but not for proprietary/client code.
  • Questions arise about legality for EU users and potential for services to be blocked regionally.

Practical Usage and Local Setup

  • Common local stack: Ollama (or LocalAI, llamafile, LM Studio, text-generation-webui) + IDE plugin (e.g., Continue) for completions and chat.
  • Yi-Coder is available via Ollama with multiple quantizations; users note that only 4-bit was obvious at first but more variants are hidden under “view more.”
  • Tools that expose OpenAI-compatible APIs can be pointed at local backends to integrate into existing workflows.

Performance, Quality, and Benchmarks

  • Some users report Yi-Coder hallucinating, rambling, or mixing multiple languages in one answer; others find it “working great” once configured correctly.
  • Aider’s leaderboard shows Yi-Coder-9B-Chat at 54% vs GPT-3.5 at 58% and Claude 3.5 Sonnet at 77% on Python code-edit tasks; quantization (q4_0) drops Yi-Coder further.
  • There is skepticism about over-relying on narrow benchmarks (e.g., 113 Python tasks) as proxies for broad coding capability.
  • Claude 3.5 Sonnet is widely regarded as the code-quality “gold standard,” with DeepSeek-Coder-V2 praised as the best price/performance; Yi-Coder seen as promising but not state-of-the-art.

Local Hardware, Quantization, and Context

  • Yi-Coder 8–9B is reported as runnable on consumer GPUs like an RTX 4090 (24 GB VRAM) and possibly 16 GB cards using quantization.
  • Users discuss trade-offs: FP16 vs quantized (Q4/Q8), VRAM use, and how quantization can hurt quality.
  • One user solved severe misbehavior by limiting Ollama to a single concurrent model and adjusting context and output-length settings, then achieved long-context use (e.g., 65K input tokens).

Programmers vs Artists on AI

  • Commenters note that LLM code assistants are seen as productivity tools rather than direct job replacements; code has a binary “works/doesn’t work” standard and still requires expertise to prompt and validate.
  • In contrast, AI image generators more directly replace low-end, “good enough” visual work, leading to stronger backlash from artists whose income often depends on such tasks.
  • Several participants argue that both codegen and imagegen are currently best at “low-level” or throwaway work; deep, intentional creative or complex engineering tasks remain hard for models.