Voice Isolator: Strip background noise for film, podcast, interview production
State of the Art in Speech-to-Text and Noisy Audio
- Several users recommend Whisper (including MacWhisper and Buzz frontends) as strong, general-purpose STT, but note it may struggle when speech is barely above the noise floor.
- Deepgram Nova 2 is reported as more accurate than Whisper in some testing; a free online demo is suggested.
- Gemini 1.5 Pro with audio input is described as “far better than any transcription model” for complex, noisy, multilingual interviews, but output length and repetition issues require chunking audio.
- Some argue “audio forensics” companies using specialized tools and human effort still represent the practical SOTA for extremely poor recordings.
- One commenter suggests simply paying humans to transcribe difficult audio, raising the verification problem for AI transcripts.
Noise Reduction vs ASR Performance
- Traditional tools like Audacity noise reduction, Adobe Podcast “Enhance Speech,” Auphonic, ai|coustics, Nvidia Broadcast, Krisp, DeepFilterNet, and DAW/VST workflows are widely mentioned.
- Reports on ElevenLabs’ Voice Isolator are mixed: some find it no better than tuned ffmpeg filters; others say it removes music but leaves speech garbled or even outputs silence.
- A technical concern: denoising may introduce distortions unseen in ASR training data, sometimes making recognition worse than with noisy input.
Pricing Model and “Characters” Confusion
- Many criticize ElevenLabs’ “1000 characters per minute of audio” phrasing as opaque and off-putting.
- Confusion centers on what “character” means when the task is audio cleanup, not TTS or STT.
- Some interpret “characters” as a site-wide credit unit reused from text-based products; others compare it to game “premium currency” that obscures real cost and leads to overbuying.
- Several call the service expensive, especially for multi-hour podcasts.
Cloud-Only, Privacy, and Voice Cloning Concerns
- Users dislike that ElevenLabs’ tools are cloud-only and wish for a Topaz-like, fully local desktop solution.
- There is worry about uploading personal voice samples to “random” sites; people predict hearing their cloned voices in ads or content.
- ElevenLabs’ licensed use of deceased celebrities’ voices prompts ethical unease, even if legal via estates.
Open Source and Local Alternatives
- Open source voice tech (e.g., GPTSOVITS, StyleTTS2, RVCv2) is seen as lagging far behind ElevenLabs for TTS/voice conversion.
- Some point to free or one-time-purchase tools (Ultimate Vocal Remover, Supertone, Virtual DJ stems, DeepFilterNet) as viable local options for isolation/cleanup.
- There is explicit demand for local, open solutions and for STT that includes speaker diarization, which is noted as still lacking.
Social and Legal Side Effects
- Improved isolation undermines a previous tactic of blasting copyrighted music to demonetize or block unwanted recordings (e.g., “First Amendment auditors” and some police responses).
- Debate emerges over whether these auditors are valuable civil-rights watchdogs or harassing nuisances, and whether using copyrighted music as a “countermeasure” is ethical or even legal.