sansxel
sansxel
The AI workshop for makers
ProductLearnPricingContact
sansxelsansxel

The adaptive AI platform. One AI, infinite shapes, a contextual interface that reshapes itself around how you actually work.

Product
ProductLearnPricingDownload
Account
DashboardDownloadUpdatesIntegrationsUsage
Company
PrivacyTermsContact
Community
Discord
© 2026 sansxel. All rights reserved.
All articles
Topics
AI11
ConceptsHow it worksVoiceVision
Coding5
Databases1
APIs3
MCP1
Systems4
Build1
Skills2
Monetization3
Learn/AI/Voice
beginner5 min read

How voice AI works (talk → text → AI → speech)

When you talk to an AI, three different models are running in sequence. Here's what each one does and why latency matters more than you think.

BySansxel (OWNER)·Apr 25, 2026

Voice AI feels like one thing, you talk, it talks back. Under the hood it's actually three separate models chained together, and each one adds latency. Knowing the chain helps you understand why some voice apps feel snappy and others feel laggy.

🎙️ → 📝 → 🧠 → 🔊
Speech → Text (STT) → LLM → Speech (TTS)

Step 1, Speech to text (STT)

Your microphone records audio. A speech recognition model (Whisper is the popular open one) converts that audio into a text transcript. This usually takes 200ms to 1s depending on how long you spoke and which model is doing it.

Step 2, The LLM thinks

The transcript gets sent to a language model, same kind that powers a normal text chat. It generates a reply, usually streaming word by word so the next step can start before it's finished.

Step 3, Text to speech (TTS)

A speech synthesis model takes the text reply and turns it into audio. Modern TTS models sound very natural, they handle pauses, emphasis, and emotion, not just word-by-word reading.

The trick to feeling fast
Good voice apps don't wait for the LLM to finish writing before starting TTS. They stream sentences as they come out, TTS each one immediately, and play them back-to-back. Cuts perceived latency in half.

What sansxel does differently

  • Two voice modes: Dictate (talk → AI types) and Talk (full hands-free conversation).
  • VAD (voice activity detection): the mic figures out when you're done, no push-to-talk button.
  • Live preview transcript: Web Speech API runs in parallel so you see your words as you speak, while Whisper still produces the canonical version.
Hit Voice in the workshop→
Write for sansxel

Want your work in the Learn library? Apply for a hardlocked byline.

Apply to write

Keep learning

beginner4 min
What is AI, really?

AI like ChatGPT works by predicting the next word, kind of like autocomplete, but way smarter. Here's the plain-English version of what's happening under the hood.

AI
beginner10 min
Build your first AI app in 10 minutes

A real working chat app, end to end. JavaScript, no framework, ~50 lines. You'll get a feel for how requests, streaming, and prompts fit together.

Build
beginner6 min
Sansxel REST API, quickstart

Authenticate, send a chat request, stream a reply. Three steps. Copy-paste examples in JavaScript and Python.

APIs