Voice Isolator: Strip background noise for film, podcast, interview production

State of the Art in Speech-to-Text and Noisy Audio

  • Several users recommend Whisper (including MacWhisper and Buzz frontends) as strong, general-purpose STT, but note it may struggle when speech is barely above the noise floor.
  • Deepgram Nova 2 is reported as more accurate than Whisper in some testing; a free online demo is suggested.
  • Gemini 1.5 Pro with audio input is described as “far better than any transcription model” for complex, noisy, multilingual interviews, but output length and repetition issues require chunking audio.
  • Some argue “audio forensics” companies using specialized tools and human effort still represent the practical SOTA for extremely poor recordings.
  • One commenter suggests simply paying humans to transcribe difficult audio, raising the verification problem for AI transcripts.

Noise Reduction vs ASR Performance

  • Traditional tools like Audacity noise reduction, Adobe Podcast “Enhance Speech,” Auphonic, ai|coustics, Nvidia Broadcast, Krisp, DeepFilterNet, and DAW/VST workflows are widely mentioned.
  • Reports on ElevenLabs’ Voice Isolator are mixed: some find it no better than tuned ffmpeg filters; others say it removes music but leaves speech garbled or even outputs silence.
  • A technical concern: denoising may introduce distortions unseen in ASR training data, sometimes making recognition worse than with noisy input.

Pricing Model and “Characters” Confusion

  • Many criticize ElevenLabs’ “1000 characters per minute of audio” phrasing as opaque and off-putting.
  • Confusion centers on what “character” means when the task is audio cleanup, not TTS or STT.
  • Some interpret “characters” as a site-wide credit unit reused from text-based products; others compare it to game “premium currency” that obscures real cost and leads to overbuying.
  • Several call the service expensive, especially for multi-hour podcasts.

Cloud-Only, Privacy, and Voice Cloning Concerns

  • Users dislike that ElevenLabs’ tools are cloud-only and wish for a Topaz-like, fully local desktop solution.
  • There is worry about uploading personal voice samples to “random” sites; people predict hearing their cloned voices in ads or content.
  • ElevenLabs’ licensed use of deceased celebrities’ voices prompts ethical unease, even if legal via estates.

Open Source and Local Alternatives

  • Open source voice tech (e.g., GPTSOVITS, StyleTTS2, RVCv2) is seen as lagging far behind ElevenLabs for TTS/voice conversion.
  • Some point to free or one-time-purchase tools (Ultimate Vocal Remover, Supertone, Virtual DJ stems, DeepFilterNet) as viable local options for isolation/cleanup.
  • There is explicit demand for local, open solutions and for STT that includes speaker diarization, which is noted as still lacking.

Social and Legal Side Effects

  • Improved isolation undermines a previous tactic of blasting copyrighted music to demonetize or block unwanted recordings (e.g., “First Amendment auditors” and some police responses).
  • Debate emerges over whether these auditors are valuable civil-rights watchdogs or harassing nuisances, and whether using copyrighted music as a “countermeasure” is ethical or even legal.