ElevenLabs TTS
What is ElevenLabs TTS?
ElevenLabs TTS turns text into natural-sounding speech, with four model variants on the same endpoint. Send the text and a voice ID and the call works; everything else is optional. The four models trade latency, language coverage, and character cap: Flash v2.5 returns first audio in roughly 75 ms across 32 languages, Turbo v2.5 sits at about 250 ms with stronger quality, Multilingual v2 covers 29 languages at the highest quality with a 10,000-character cap, and V3 reaches 70+ languages with the most dramatic delivery. Output formats span compressed MP3 and Opus, full-rate WAV, raw PCM, and 8 kHz μ-law and a-law for telephony. That coverage is what makes the same endpoint usable for both consumer-facing playback and IVR pipelines.
Key features of ElevenLabs TTS
Five features cover the surface area you'll touch in a real production pipeline.

Four model variants for the latency-quality-language trade
Flash v2.5 returns first audio in roughly 75 ms across 32 languages, which fits live agents and IVR. Turbo v2.5 sits at about 250 ms with stronger quality. Multilingual v2 is the default for long-form narration. V3 covers 70+ languages with the most dramatic delivery.

Eighteen output formats, including telephony
MP3 and Opus across five bitrates each, raw PCM at five sample rates, full-rate WAV, plus 8 kHz ulaw_8000 and alaw_8000 that drop straight onto Twilio and other voice circuits without an intermediate transcoder.

Voice settings you can actually tune
Stability, similarity boost, style, speed, and a speaker boost toggle. Each has a sensible default, but the controls are precise enough that you can shape the same voice for an audiobook one week and a high-energy promo the next without re-cloning.

Seeded, reproducible output
Pass an integer seed and the same text, voice, model variant, and settings return identical audio. That makes regression tests possible, lets games keep cached lines deterministic, and means a re-render after a copy edit only changes the words you changed.

Live voice catalog including your clones
Cloned voices created inside ElevenLabs show up automatically through the resources catalog, with no redeployment, no SDK update, and no separate fetch path. Pass any returned ID as the voice parameter and the call routes correctly.
Best for
Real-time voice agents and IVR
Flash v2.5 keeps end-to-end latency low and the ulaw_8000 / alaw_8000 outputs drop straight onto Twilio and other 8 kHz voice circuits.
Audiobooks and long-form narration
Multilingual v2 plus a locked seed produces consistent voice across chapters; the 10,000-character cap fits a typical chapter in a single call.
Localized release work
Multilingual v2 covers 29 languages, V3 reaches 70+. Mix variants per locale to balance quality and language coverage in one pipeline.
Game and app character lines
Pass the same seed and voice settings per character and lines stay consistent across retakes during patch cycles.
Assistive playback for UI strings
Lock voice and settings per locale and TTS becomes a deterministic component of an accessible product.
Trailer narration and explainer voiceover
V3 with style turned up adds the dramatic delivery promo content needs without manual ADR.
Variants
Four model variants share one endpoint. Pick by the latency, quality, and language coverage you need.
Flash v2.5
First audio in ~75 ms, 32 languages, 40,000 characters per call. Use it for live voice agents, IVR, and any pipeline where end-to-end latency is the headline constraint.
Turbo v2.5
First audio in ~250 ms, 32 languages, 40,000 characters per call. Same broad language coverage as Flash with stronger quality. Good when you can wait through a quarter-second of extra latency and want a richer voice on the other end.
Multilingual v2
29 languages, 10,000 characters per call. The default for long-form narration. Pair it with a locked seed and you get consistent voice across chapters, episodes, and series, with no drift between calls.
V3
70+ languages, 5,000 characters per call. The most expressive of the four, with the biggest dynamic range and the most dramatic delivery. Use it for trailer narration, promo VO, and any cut where the voice has to carry the clip.
Use cases
Wire a real-time voice agent to Twilio in a day: call the TTS endpoint with output_format set to ulaw_8000, hand the buffer to your media stream, and Flash v2.5's first-audio latency keeps conversation natural. Produce a localized audiobook by writing one chapter generator that loops over locales and swaps between Multilingual v2 (29 languages) and V3 (70+) per market. Build deterministic in-game dialogue by binding each NPC to a fixed voice plus a fixed seed and settings profile. Layer accessible UI playback by pre-rendering common strings server-side and caching the resulting audio URLs.
API examples
Call ElevenLabs TTS from any language by POSTing to /v1/tasks. Full parameter docs live at docs.unifically.com/models/audio/elevenlabs/text-to-speech.
curl -X POST https://api.unifically.com/v1/tasks \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "elevenlabs/text-to-speech",
"input": {
"text": "Hello world, this is a test.",
"voice": "TX3LPaxmHKxFdv7VOQHJ",
"model_id": "eleven_flash_v2_5",
"output_format": "mp3_44100_128"
}
}'
Successful submission returns a task_id. Poll GET /v1/tasks/<task_id> or set a callback_url on the request to receive the finished audio URL.
FAQs
People also ask
ElevenLabs TTS is a text-to-speech endpoint that turns any string into spoken audio. Send the text and a voice ID, optionally pick the model variant and output format, optionally tune the voice settings, and Unifically returns a finished audio URL after the task completes.
Four. Flash v2.5 (~75 ms first audio, 32 languages, 40,000 character limit), Turbo v2.5 (~250 ms, 32 languages, 40,000 chars), Multilingual v2 (29 languages, 10,000 chars, the default for long-form work), and V3 (70+ languages, 5,000 chars, the most expressive).
Eighteen total. MP3 at 32, 64, 96, 128, and 192 kbps. Opus at 32, 64, 96, 128, and 192 kbps. Raw PCM at 16, 22.05, 24, 44.1, and 48 kHz. WAV 44.1 kHz. And telephony-grade ulaw_8000 and alaw_8000 for 8 kHz voice circuits. The default is mp3_44100_128.
Character limit depends on the model. Flash v2.5 and Turbo v2.5 accept up to 40,000 characters per call. Multilingual v2 caps at 10,000. V3 caps at 5,000. Past these limits, split the text into multiple calls.
Yes. Pass an integer seed in the range 0 to 4,294,967,295. The same text, voice, model, voice settings, and seed return the same audio. Useful for regression tests and deterministic playback in games and assistive tech.
Five things. Stability trades emotion for monotone, similarity boost controls how closely the output matches the source voice, style amplifies the speaker's character at the cost of latency, speed multiplies normal pace, and use_speaker_boost adds extra similarity at a small latency cost.
Yes. Voice cloning is set up inside ElevenLabs and the resulting voice ID becomes selectable here. Pull your voice list from the resources catalog and any returned ID, including clones, drops in as the voice parameter.