Realtime TTS

Streaming text-to-speech with sub-second first-audio latency. Built for voice agents and live applications.

ನಿನ್ನ ಭಾಷೆಯಲ್ಲಿ ನಮಗೆ ಟಿಟ್ಸ್‌ ಇಲ್ಲ, ನಿನ್ನ ಸ್ವರಗಳನ್ನು ಕೂಡಿಸು; ನಮಗೆ ಸಹಾಯಮಾಡು. ಧ್ವನಿಯನ್ನು ಮಾರಿರಿ

Text

Streaming
0/5,000 ಲಿಪ್ಯಂಶಗಳು ~0.3s first audio

Voice & Settings

Streaming-capable models only.

Live Latency

Click Stream to measure first-audio latency

Output

Audio chunks will play here as they stream in.

First chunk:
Total chunks: 0
Total time:

How Streaming TTS Works

1. Send Text

POST text to /v1/tts/stream/ as a Server-Sent Events request.

2. Model Generates

Kokoro chunks the text and generates audio sample-by-sample on the GPU.

3. Stream Chunks

Base64-encoded WAV chunks arrive over SSE and start playing immediately.

4. Listen Live

User hears the start of the sentence in under a second, even on long inputs.

ಕೇಸ್‌ಗಳನ್ನು ಬಳಸು

Where sub-second latency unlocks new experiences.

Voice Agents

Conversational bots that respond as fast as a human would.

Live Dubbing

Translate and dub a stream in real time without buffering pauses.

Games

NPC dialog that reacts to player choices instantly, no pre-rendered VO.

Accessibility

Screen readers and assistive tools that start speaking the moment a user clicks.

Realtime TTS Plans

ನಿಮಗೆ ಹೆಚ್ಚು ಅಗತ್ಯವಿರುವಾಗ ಮುಕ್ತವಾಗಿ ಆರಂಭಿಸು, ಅಪ್‌ಡೇಟ್ ಮಾಡು

Free
  • Kokoro streaming (free model)
  • 500 characters per generation
  • 10 free streams/day per anonymous user
  • Sub-second first-audio latency
  • SSE streaming over HTTPS
ಹೆಚ್ಚು ಜನಪ್ರಿಯ
Free Account
  • 15,000 characters at signup
  • 5,000 chars per stream
  • API key for programmatic access
  • Generation history
  • No daily stream cap
ಮುಕ್ತವಾಗಿ ಮೇಲೆ ಗುರುತಿಸು
Pro
  • MOSS-TTS-Realtime (when live)
  • 100,000 chars per stream
  • Priority GPU queue
  • Voice agent + Twilio integration
  • Higher rate limits
ಊರ್ಜಿತಗೊಳಿಸು

ಅನೇಕವೇಳೆ ಪ್ರಶ್ನೆಗಳು

Realtime text-to-speech streams audio chunks as they are generated, instead of waiting for the entire sentence to complete. The first audio sample arrives in under one second, making it suitable for live voice agents, dubbing, and interactive applications where latency matters.

Regular TTS generates the full audio file before returning anything — you wait, then hear the entire sentence at once. Realtime TTS uses Server-Sent Events (SSE) to stream short audio chunks as the model produces them. The user hears the start of the sentence almost immediately, even on long inputs.

Kokoro is the default backend — it generates audio roughly 100x faster than real time on a modern GPU. We are integrating MOSS-TTS-Realtime as a higher-quality alternative; users will be able to choose per request once that ships.

Typical first-audio latency on Kokoro is 300-800ms over a public connection. Network round-trip dominates after that. The page surfaces the live measured time-to-first-audio in the UI so you can see exactly how long each request took.

Voice agents that respond conversationally, live dubbing for streaming media, interactive game NPCs, accessibility readers that start speaking the moment a user clicks, and any application where waiting two or three seconds for audio would feel sluggish.

Yes. POST to https://api.tts.ai/v1/tts/stream/ with the same body as the regular /v1/tts/ endpoint. The response is an SSE stream of base64-encoded WAV chunks. The free tier supports 10 generations per day per anonymous user; authenticated users get the full per-account character allowance.

Kokoro uses pre-trained voices and does not clone. MOSS-TTS-Realtime (when integrated) supports zero-shot voice cloning from a 3-second reference. For full voice cloning today, use the regular /text-to-speech/ page with Chatterbox or GPT-SoVITS — those are not streaming-capable but produce custom voices.

Same character cost as the regular TTS endpoint. Kokoro is free-tier (1x cost). MOSS-TTS-Realtime will run at the standard tier (2x cost) when enabled. The streaming protocol does not add any pricing surcharge.

Yes — pair the streaming endpoint with a Twilio voice webhook to feed live audio into a phone call. Our voice agent platform already does this for IVR and outbound calling. End-to-end latency on a phone call is typically 1-2 seconds including STT and LLM response.

If your network drops a chunk in transit, the streaming player will skip ahead rather than stall. For applications that cannot tolerate gaps, fall back to the regular non-streaming endpoint, or buffer 500ms of audio before starting playback.
5.0/5 (1)

ನೀವೇನು ಉತ್ತರ ಕೊಡುವಿರಿ?

Stream Speech in Real Time

Free for the first 10 generations a day. Sign up to unlock the full character allowance and API access.