Show HN: LemonSlice – Upgrade your voice agents to real-time video

Use Cases and Enthusiasm

  • Many commenters find the real-time video agents unusually impressive, describing “mind blown” reactions and extended tinkering.
  • Popular imagined uses:
    • Turning coding/chat agents into more “employee-like” coworkers that record Loom-style walkthroughs.
    • Roleplay-based training (nurses triaging patients, SDRs practicing sales calls).
    • Language tutoring, customer support, and website onboarding.
  • Some users already built demos (e.g., a golden retriever tutor) and report a strong “computer has come alive” feeling.

Architecture, Integrations, and Controls

  • LemonSlice is positioned as a video “avatar layer” on top of arbitrary voice agents.
    • API takes text and streams back synchronized video.
    • LiveKit integration allows plugging in OpenAI realtime, other STT/LLM/TTS stacks, or future S2S providers.
    • Hosted option currently partners with ElevenLabs; default LLM in their own stack is Qwen.
  • Users can influence avatar motion and emotion via text prompts; finer-grained motion control via API is planned.
  • Background motion is also prompt-controlled; better hand-motion control is in training.

Quality, Latency, and UX Feedback

  • Praise for A/V sync and responsiveness overall, but several issues noted:
    • Low resolution/FPS, inconsistent lip-sync, and “cheap mic” audio feel for some avatars.
    • Latency is noticeable, especially vs NVIDIA Personaplex; speed is a stated main focus area.
    • STT‑LLM‑TTS limits nuanced speech/pronunciation feedback (e.g., Spanish dialect practice); S2S is desired but currently too slow in tests.
    • Occasional visual “hallucinations” (e.g., pseudo‑Chinese subtitles).
  • UI confusions: 10s GPU spin-up looks like processing delay; demo video defaulting to 1.5x; privacy page unreadable in dark mode (quickly fixed); some mobile iOS issues (details unclear).

Pricing, Product Model, and Openness

  • Confusion around pricing: difference between “Video Agents” (interactive calls) and “Creative Studio” (downloadable clips) needed explicit clarification.
  • Real-time calls are fully streamed; there’s no native “record and replay exact answer later” feature.
  • Core model is a 20B-parameter diffusion transformer running ~20fps on a single Hopper GPU. Team expects similar approaches to be widely copied; they see substantial “low-hanging fruit” in real-time DiT optimization.
  • Open-weights release is under consideration; concerns are support overhead, not just customer cannibalization. No concrete commitment yet.
  • IP protection, profitability, and business metrics are asked about but not substantively answered in-thread (status unclear).

Ethical and Societal Concerns

  • Multiple commenters express strong discomfort and “Absolutely Do Not Want” reactions, especially around:
    • AI-only interviews, HR interactions, and training replacing human contact.
    • Call-center automation and further degrading human-facing services.
    • Photorealistic avatars worsening deepfake/identity-trust problems; preference for clearly non-human robots.
  • Others argue that harms are manageable and comparable to past disruptive tech (cars, nuclear power), urging focus on benefits and customer value rather than halting development.
  • Brief suspicion of astroturfing due to overwhelming positivity is raised; a moderator reminds participants this kind of accusation is against HN guidelines.