TTS (Text-to-Speech) — AI Voice Companion Glossary

Definition

TTS (Text-to-Speech), also called speech synthesis, is the technology that converts written text into spoken audio. In AI companion platforms, TTS is the final stage of a voice call pipeline — after the AI generates a text response to your message, TTS synthesizes that text into the companion’s voice and delivers it as audio you hear.

TTS quality is the primary determinant of whether a companion’s voice sounds natural, expressive, and character-specific — or robotic and generic.

How TTS Works in a Voice Call

In a real-time AI companion voice call:

Your speech is transcribed to text by STT
The AI language model generates a text response in the companion’s character
TTS converts that text response to speech in the companion’s voice
The audio is delivered to you

Streaming TTS — where audio begins generating before the full response text is ready — reduces perceived latency. Rather than waiting for the entire response to be generated and then converting it, streaming TTS begins producing audio from the first words of the response while the rest is still being generated.

What Makes TTS Good vs Poor

Voice naturalness — does it sound like a real person speaking, or clearly like synthesized speech? Modern AI TTS in 2026 is significantly more natural than older text-to-speech systems. Companion platform TTS is orders of magnitude better than the robotic speech synthesis most people associate with the term.

Expressiveness — does the voice convey appropriate emotional weight? A companion delivering tender words should sound different from one delivering a sharp remark. Good TTS carries emotional register; poor TTS is tonally flat regardless of content.

Character specificity — does the voice match the companion’s defined character? A confident, bold companion should sound different from a soft, shy one. High-quality companion TTS systems have voices that feel tailored to specific personality types.

Latency — how quickly does audio begin after the AI generates the response? Streaming TTS minimizes this. High latency TTS creates awkward gaps in conversation flow.

TTS vs Voice Notes vs Real-Time Voice

These three terms are often confused:

Voice notes — TTS on demand. You tap a button; a specific message is read aloud in the companion’s voice. One-way, not real-time.

Real-time voice calls — TTS as part of a live bidirectional call pipeline. The AI generates a response and TTS delivers it immediately as part of a continuous conversation. This is what AI voice chat means.

Pre-recorded voice — clips recorded by a voice actor, played back. Some simpler companion products use this rather than generative TTS. The difference is that pre-recorded voice cannot generate new content dynamically — it’s a fixed library of phrases.

On Affiny, both voice notes (1 coin each) and real-time voice calls (0.5 coins/sec) use generative TTS — the companion’s voice is synthesized live for every piece of audio, not played from a pre-recorded library.

FAQ

What is TTS in AI companion apps?

TTS (Text-to-Speech) is the technology that converts the AI companion’s text responses into spoken audio in the companion’s voice. It’s the last step in every voice interaction — real-time calls and voice notes both use TTS.

Why do some AI companion voices sound better than others?

TTS quality varies by platform and the model they use. Some platforms use higher-quality, more expressive synthesis models; others use cheaper, more robotic alternatives. The character-specificity of the voice (whether it sounds tailored to the companion’s personality rather than generic) also varies significantly.

Is the AI companion voice pre-recorded or generated?

On modern platforms like Affiny, companion voices are generatively synthesized — created in real time from the text response, not played from a library of pre-recorded clips. This means the voice can say anything, not just pre-scripted phrases, and can express a wide range of emotional content.

What is the difference between TTS and STT?

STT (Speech-to-Text) converts your spoken voice into text so the AI can understand it. TTS (Text-to-Speech) converts the AI’s text response into spoken audio so you can hear it. STT is the input stage; TTS is the output stage. Together they form the audio pipeline of a real-time voice call.