How voice AI works (talk → text → AI → speech)
When you talk to an AI, three different models are running in sequence. Here's what each one does and why latency matters more than you think.
Voice AI feels like one thing, you talk, it talks back. Under the hood it's actually three separate models chained together, and each one adds latency. Knowing the chain helps you understand why some voice apps feel snappy and others feel laggy.
Step 1, Speech to text (STT)
Your microphone records audio. A speech recognition model (Whisper is the popular open one) converts that audio into a text transcript. This usually takes 200ms to 1s depending on how long you spoke and which model is doing it.
Step 2, The LLM thinks
The transcript gets sent to a language model, same kind that powers a normal text chat. It generates a reply, usually streaming word by word so the next step can start before it's finished.
Step 3, Text to speech (TTS)
A speech synthesis model takes the text reply and turns it into audio. Modern TTS models sound very natural, they handle pauses, emphasis, and emotion, not just word-by-word reading.
What sansxel does differently
- Two voice modes: Dictate (talk → AI types) and Talk (full hands-free conversation).
- VAD (voice activity detection): the mic figures out when you're done, no push-to-talk button.
- Live preview transcript: Web Speech API runs in parallel so you see your words as you speak, while Whisper still produces the canonical version.
Want your work in the Learn library? Apply for a hardlocked byline.