2024-07-03

Voice Isolator: Strip background noise for film, podcast, interview production

State of the Art in Speech-to-Text and Noisy Audio

Several users recommend Whisper (including MacWhisper and Buzz frontends) as strong, general-purpose STT, but note it may struggle when speech is barely above the noise floor.
Deepgram Nova 2 is reported as more accurate than Whisper in some testing; a free online demo is suggested.
Gemini 1.5 Pro with audio input is described as “far better than any transcription model” for complex, noisy, multilingual interviews, but output length and repetition issues require chunking audio.
Some argue “audio forensics” companies using specialized tools and human effort still represent the practical SOTA for extremely poor recordings.
One commenter suggests simply paying humans to transcribe difficult audio, raising the verification problem for AI transcripts.

Noise Reduction vs ASR Performance

Traditional tools like Audacity noise reduction, Adobe Podcast “Enhance Speech,” Auphonic, ai|coustics, Nvidia Broadcast, Krisp, DeepFilterNet, and DAW/VST workflows are widely mentioned.
Reports on ElevenLabs’ Voice Isolator are mixed: some find it no better than tuned ffmpeg filters; others say it removes music but leaves speech garbled or even outputs silence.
A technical concern: denoising may introduce distortions unseen in ASR training data, sometimes making recognition worse than with noisy input.

Pricing Model and “Characters” Confusion

Many criticize ElevenLabs’ “1000 characters per minute of audio” phrasing as opaque and off-putting.
Confusion centers on what “character” means when the task is audio cleanup, not TTS or STT.
Some interpret “characters” as a site-wide credit unit reused from text-based products; others compare it to game “premium currency” that obscures real cost and leads to overbuying.
Several call the service expensive, especially for multi-hour podcasts.

Cloud-Only, Privacy, and Voice Cloning Concerns

Users dislike that ElevenLabs’ tools are cloud-only and wish for a Topaz-like, fully local desktop solution.
There is worry about uploading personal voice samples to “random” sites; people predict hearing their cloned voices in ads or content.
ElevenLabs’ licensed use of deceased celebrities’ voices prompts ethical unease, even if legal via estates.

Open Source and Local Alternatives

Open source voice tech (e.g., GPTSOVITS, StyleTTS2, RVCv2) is seen as lagging far behind ElevenLabs for TTS/voice conversion.
Some point to free or one-time-purchase tools (Ultimate Vocal Remover, Supertone, Virtual DJ stems, DeepFilterNet) as viable local options for isolation/cleanup.
There is explicit demand for local, open solutions and for STT that includes speaker diarization, which is noted as still lacking.

Social and Legal Side Effects

Improved isolation undermines a previous tactic of blasting copyrighted music to demonetize or block unwanted recordings (e.g., “First Amendment auditors” and some police responses).
Debate emerges over whether these auditors are valuable civil-rights watchdogs or harassing nuisances, and whether using copyrighted music as a “countermeasure” is ethical or even legal.

Related topics