Show HN: LemonSlice – Upgrade your voice agents to real-time video
Use Cases and Enthusiasm
- Many commenters find the real-time video agents unusually impressive, describing “mind blown” reactions and extended tinkering.
- Popular imagined uses:
- Turning coding/chat agents into more “employee-like” coworkers that record Loom-style walkthroughs.
- Roleplay-based training (nurses triaging patients, SDRs practicing sales calls).
- Language tutoring, customer support, and website onboarding.
- Some users already built demos (e.g., a golden retriever tutor) and report a strong “computer has come alive” feeling.
Architecture, Integrations, and Controls
- LemonSlice is positioned as a video “avatar layer” on top of arbitrary voice agents.
- API takes text and streams back synchronized video.
- LiveKit integration allows plugging in OpenAI realtime, other STT/LLM/TTS stacks, or future S2S providers.
- Hosted option currently partners with ElevenLabs; default LLM in their own stack is Qwen.
- Users can influence avatar motion and emotion via text prompts; finer-grained motion control via API is planned.
- Background motion is also prompt-controlled; better hand-motion control is in training.
Quality, Latency, and UX Feedback
- Praise for A/V sync and responsiveness overall, but several issues noted:
- Low resolution/FPS, inconsistent lip-sync, and “cheap mic” audio feel for some avatars.
- Latency is noticeable, especially vs NVIDIA Personaplex; speed is a stated main focus area.
- STT‑LLM‑TTS limits nuanced speech/pronunciation feedback (e.g., Spanish dialect practice); S2S is desired but currently too slow in tests.
- Occasional visual “hallucinations” (e.g., pseudo‑Chinese subtitles).
- UI confusions: 10s GPU spin-up looks like processing delay; demo video defaulting to 1.5x; privacy page unreadable in dark mode (quickly fixed); some mobile iOS issues (details unclear).
Pricing, Product Model, and Openness
- Confusion around pricing: difference between “Video Agents” (interactive calls) and “Creative Studio” (downloadable clips) needed explicit clarification.
- Real-time calls are fully streamed; there’s no native “record and replay exact answer later” feature.
- Core model is a 20B-parameter diffusion transformer running ~20fps on a single Hopper GPU. Team expects similar approaches to be widely copied; they see substantial “low-hanging fruit” in real-time DiT optimization.
- Open-weights release is under consideration; concerns are support overhead, not just customer cannibalization. No concrete commitment yet.
- IP protection, profitability, and business metrics are asked about but not substantively answered in-thread (status unclear).
Ethical and Societal Concerns
- Multiple commenters express strong discomfort and “Absolutely Do Not Want” reactions, especially around:
- AI-only interviews, HR interactions, and training replacing human contact.
- Call-center automation and further degrading human-facing services.
- Photorealistic avatars worsening deepfake/identity-trust problems; preference for clearly non-human robots.
- Others argue that harms are manageable and comparable to past disruptive tech (cars, nuclear power), urging focus on benefits and customer value rather than halting development.
- Brief suspicion of astroturfing due to overwhelming positivity is raised; a moderator reminds participants this kind of accusation is against HN guidelines.