home/work/doppler
Research · R&D2024

Doppler

A real-time neural voice-conversion pipeline that started as a VRChat joke and turned into a firsthand study of synthetic-voice fraud. Speech goes in, a different voice comes out — fast enough to feel live. The whole project became one problem: collapsing end-to-end latency until the illusion held.

01

Overview

summary

Doppler is a real-time voice-conversion pipeline. You speak into a microphone and a completely different voice comes out the other side, with little enough delay that it reads as a live person rather than a recording.

It's built from two off-the-shelf services wired into a tight loop: OpenAI Whisper for speech-to-text and ElevenLabs for voice synthesis, with the synthesized audio routed back in as a virtual microphone so it can drive a live application like a game voice chat. The interesting part was never connecting the APIs — it was the engineering needed to make the round-trip fast enough to feel natural.

02

It Started as a Joke

origin

A friend and I were in VRChat and wanted to actually sound like the avatars we were playing — to have a cartoon character talk back to people in its own voice. The first version was crude: I wired up an ElevenLabs voice model and typed the lines by hand.

It fell apart instantly. Typing was far too slow, and the conversations felt scripted and dead — nothing organic, nothing that could pass for a real person reacting in the moment. If the goal was to convince someone they were talking to the character, a chat box behind the curtain was never going to work. So I took myself out of the typing loop entirely.

03

The Pipeline

architecture

I added Whisper to transcribe my own speech as I talked, fed that text straight into the ElevenLabs voice model, and routed the synthesized audio back into the game as my microphone input. Now I could just speak and come out the other side as someone else. It worked — and the voice itself was convincing.

Real-time conversion path

My voice in, a cloned voice out — output re-injected as mic input
REAL-TIME CONVERSION PATH LIVE MIC my voice · chunked WHISPER speech → text (STT) ELEVENLABS text → cloned voice GAME AUDIO out → VRChat mic + delay + delay every stage adds latency — the enemy of "live"
Synthesized audio is re-injected as a virtual microphone, so the cloned voice drives the live voice chat in place of my own.
04

The Real Work — Killing the Delay

latency engineering

The first working version lagged. The round-trip was slow enough that there was a beat of dead air before every reply, and people could tell something was off. A delayed voice doesn't read as a person; it reads as a recording. From there the entire project became a single question: how do you get a cloned voice back fast enough to feel live instead of delayed?

That's where the engineering actually happened — not in the wiring, but in fighting every source of latency in the chain. The fine-tuning meant streaming audio in chunks instead of waiting for whole utterances, trimming the transcription step, tightening the hand-off into synthesis, and shaving rendering time until the reply landed inside the window where conversation still feels natural. Each fraction of a second I cut made it less of a walkie-talkie and more of a voice.

Collapsing the round-trip

End-to-end latency — batched vs. streamed (relative)
low high · round-trip latency → natural-conversation threshold delayed FIRST BUILD full-utterance / batched live AFTER TUNING streamed / chunked
Illustrative of the latency reduction, not measured timings. Measured round-trip figures can drop straight into this panel.
05

Where It Stopped Being Funny

threat surface

Once the delay dropped and the voice became indistinguishable in real time, the joke curdled into something else. I was no longer holding a party trick — I was holding a working impersonation tool. The same thing that let me pass as a cartoon character would let someone pass as your son, your boss, or your bank: live, on a phone call, with no recording to give it away.

This isn't hypothetical. The FTC warns that a scammer needs only a short clip of someone's voice — often scraped from content posted online — plus a voice-cloning program, and the cloned voice makes a fraudulent request far more believable. Doppler is a small, honest demonstration of how low that barrier has fallen: built from off-the-shelf APIs by one person, over a few evenings.

Seconds
of audio scraped from social media is enough to clone a voice, per the FTC
2023
FTC issued its first consumer alert on AI voice-cloning scams — the threat is already operational
Live
real-time synthesis means a phone call, not a recording — and no tell-tale delay

Family-emergency scams

A “relative” in distress begging for money, in a voice the victim recognizes instantly. The FTC's flagship example of voice-cloning fraud.

Vishing & wire fraud

A cloned executive authorizing a transfer or approving a fake invoice over the phone — the FTC specifically warns scammers can clone a CEO's voice to fool employees.

Account takeover

Defeating voice-based identity checks and support-line verification by speaking in the account holder's own voice, in real time.

Extortion & reputation attacks

Putting fabricated words in a real person's voice — fabricated confessions, threats, or admissions that sound undeniably like them.

06

Defending Against It

mitigation

I build and harden infrastructure for a living, and the most useful thing I can do with a threat is understand it from the attacker's side first. Doppler is offensive research in the honest sense: I built the capability, felt exactly where it gets dangerous, and came away with a clearer read on how to defend against it. The reassuring part is that synthetic voice is detectable and defendable — but only if people know to look.

Out-of-band verification

A family safe-word, or hanging up and calling back on a known number, defeats nearly every voice-only attack. The FTC's own advice is blunt: don't trust the voice.

Liveness & artifact detection

Synthesis leaves spectral fingerprints. The FTC's Voice Cloning Challenge awarded real-time detectors that score audio for “liveness” in two-second chunks — exactly the seam Doppler exposes.

Process over voices

Never let a voice alone authorize money or access. Trust the procedure — verification, callbacks, second approvers — not the sound coming out of the speaker.

Shrink the source material

The attack starts with public audio. Limiting voice clips on open social profiles raises the cost of building a convincing clone in the first place.

07

Sources

receipts

The fraud case here isn't speculative — it's drawn from federal consumer-protection guidance on how voice cloning is already being used:

»

More Work

projects