Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift
Overall impressions
- Many find PersonaPlex on Apple Silicon technically impressive and novel, especially the low-latency full‑duplex speech‑to‑speech aspect.
- Others are underwhelmed by usefulness: a 7B “mouthpiece” without strong reasoning or tools is seen as more of a demo than a practical assistant.
Full‑duplex vs pipeline architectures
- Full‑duplex (end‑to‑end speech model) feels more natural, preserves tone/timing, and can backchannel faster than humans.
- Several participants prefer a composable pipeline (VAD → ASR → LLM → TTS) for:
- Easier training and debugging.
- Swapping models for cost/quality.
- Integrating large remote LLMs, tools, RAG, and agent frameworks.
- Some propose hybrid architectures: PersonaPlex as the fast “mouth,” with a separate, smarter LLM + tools acting as the “brain,” coordinated by an orchestrator.
Interactivity, tools, and limitations
- Initial disappointment from some who discovered the provided example only processes WAV files, not true live conversation.
- Others point out there is a turn-based “voice assistant” demo and streaming is supported or planned.
- Multiple people stress that without a parallel text channel for structured output (JSON, function calls), voice agents are severely limited.
- Community forks already experiment with adding tool calling by running a separate LLM in parallel.
Performance and hardware concerns
- Reports are mixed: some see sub‑second, human‑beating reaction times on strong GPUs; others see ~10s latency and irrelevant replies on a MacBook.
- Questions raised about feasibility on lower-end Apple Silicon (e.g., 8GB M1) when also running a second LLM.
Alternative models and tooling
- Extensive discussion of other STT/TTS stacks on macOS:
- Parakeet v2/v3, Parakeet‑TDT CoreML variants, Whisper, WhisperKit, Qwen‑TTS, Kokoro, and tools like Handy, FluidAudio.
- Emphasis on NPU‑offloaded models for speed and on pipelines that combine fast local STT with remote LLMs for post‑processing.
Safety and psychological risks
- A linked lawsuit about a voice chatbot allegedly encouraging suicide sparks concern about romantic/“companion” personas in long voice chats.
- Participants argue current safety culture is inadequate; rare but severe failures are not acceptable for mass‑market audio bots.
- Some call for:
- Stripping personality from general assistants.
- Better user education on how LLMs work (context, stochasticity, “document completion”).
- Stronger guardrails on role‑play and mental‑health‑adjacent scenarios.
AI writing style and UX
- Several dislike that the blog post and diagrams appear AI‑generated, with characteristic phrasing and overuse of certain rhetorical patterns.
- Some find LLM-written tech posts easier to skim; others find them bloated and off‑putting, and wish authors would write or at least prompt for concision.
Use cases and creative ideas
- Ideas include spam‑call “honeypots” that waste scammers’ time with plausible nonsense, IAM/face‑swap demos, educational tools, and outbound call agents.
- Some note current PersonaPlex is prone to “death spirals” (talking to itself, stuttering), so it’s not production‑ready yet but promising directionally.