Skip to main content
Unifically LogoUnificAlly
Model logo

ElevenLabs TTS API

    Speaker Boost
    Boost similarity to original voice. Slightly increases latency
    Output

    Your generated audio will appear here

    ElevenLabs TTS

    What is ElevenLabs TTS?

    ElevenLabs TTS turns text into natural-sounding speech, with four model variants on the same endpoint. Send the text and a voice ID and the call works; everything else is optional. The four models trade latency, language coverage, and character cap: Flash v2.5 returns first audio in roughly 75 ms across 32 languages, Turbo v2.5 sits at about 250 ms with stronger quality, Multilingual v2 covers 29 languages at the highest quality with a 10,000-character cap, and V3 reaches 70+ languages with the most dramatic delivery. Output formats span compressed MP3 and Opus, full-rate WAV, raw PCM, and 8 kHz μ-law and a-law for telephony. That coverage is what makes the same endpoint usable for both consumer-facing playback and IVR pipelines.

    Key features of ElevenLabs TTS

    Five features cover the surface area you'll touch in a real production pipeline.

    Four model variants for the latency-quality-language trade

    Four model variants for the latency-quality-language trade

    Flash v2.5 returns first audio in roughly 75 ms across 32 languages, which fits live agents and IVR. Turbo v2.5 sits at about 250 ms with stronger quality. Multilingual v2 is the default for long-form narration. V3 covers 70+ languages with the most dramatic delivery.

    Eighteen output formats, including telephony

    Eighteen output formats, including telephony

    MP3 and Opus across five bitrates each, raw PCM at five sample rates, full-rate WAV, plus 8 kHz ulaw_8000 and alaw_8000 that drop straight onto Twilio and other voice circuits without an intermediate transcoder.

    Voice settings you can actually tune

    Voice settings you can actually tune

    Stability, similarity boost, style, speed, and a speaker boost toggle. Each has a sensible default, but the controls are precise enough that you can shape the same voice for an audiobook one week and a high-energy promo the next without re-cloning.

    Seeded, reproducible output

    Seeded, reproducible output

    Pass an integer seed and the same text, voice, model variant, and settings return identical audio. That makes regression tests possible, lets games keep cached lines deterministic, and means a re-render after a copy edit only changes the words you changed.

    Live voice catalog including your clones

    Live voice catalog including your clones

    Cloned voices created inside ElevenLabs show up automatically through the resources catalog, with no redeployment, no SDK update, and no separate fetch path. Pass any returned ID as the voice parameter and the call routes correctly.

    Best for

    Real-time voice agents and IVR

    Flash v2.5 keeps end-to-end latency low and the ulaw_8000 / alaw_8000 outputs drop straight onto Twilio and other 8 kHz voice circuits.

    Audiobooks and long-form narration

    Multilingual v2 plus a locked seed produces consistent voice across chapters; the 10,000-character cap fits a typical chapter in a single call.

    Localized release work

    Multilingual v2 covers 29 languages, V3 reaches 70+. Mix variants per locale to balance quality and language coverage in one pipeline.

    Game and app character lines

    Pass the same seed and voice settings per character and lines stay consistent across retakes during patch cycles.

    Assistive playback for UI strings

    Lock voice and settings per locale and TTS becomes a deterministic component of an accessible product.

    Trailer narration and explainer voiceover

    V3 with style turned up adds the dramatic delivery promo content needs without manual ADR.

    Variants

    Four model variants share one endpoint. Pick by the latency, quality, and language coverage you need.

    Flash v2.5

    First audio in ~75 ms, 32 languages, 40,000 characters per call. Use it for live voice agents, IVR, and any pipeline where end-to-end latency is the headline constraint.

    Turbo v2.5

    First audio in ~250 ms, 32 languages, 40,000 characters per call. Same broad language coverage as Flash with stronger quality. Good when you can wait through a quarter-second of extra latency and want a richer voice on the other end.

    Multilingual v2

    29 languages, 10,000 characters per call. The default for long-form narration. Pair it with a locked seed and you get consistent voice across chapters, episodes, and series, with no drift between calls.

    V3

    70+ languages, 5,000 characters per call. The most expressive of the four, with the biggest dynamic range and the most dramatic delivery. Use it for trailer narration, promo VO, and any cut where the voice has to carry the clip.

    Use cases

    Wire a real-time voice agent to Twilio in a day: call the TTS endpoint with output_format set to ulaw_8000, hand the buffer to your media stream, and Flash v2.5's first-audio latency keeps conversation natural. Produce a localized audiobook by writing one chapter generator that loops over locales and swaps between Multilingual v2 (29 languages) and V3 (70+) per market. Build deterministic in-game dialogue by binding each NPC to a fixed voice plus a fixed seed and settings profile. Layer accessible UI playback by pre-rendering common strings server-side and caching the resulting audio URLs.

    API examples

    Call ElevenLabs TTS from any language by POSTing to /v1/tasks. Full parameter docs live at docs.unifically.com/models/audio/elevenlabs/text-to-speech.

    curl -X POST https://api.unifically.com/v1/tasks \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer YOUR_API_KEY" \
      -d '{
        "model": "elevenlabs/text-to-speech",
        "input": {
          "text": "Hello world, this is a test.",
          "voice": "TX3LPaxmHKxFdv7VOQHJ",
          "model_id": "eleven_flash_v2_5",
          "output_format": "mp3_44100_128"
        }
      }'
    

    Successful submission returns a task_id. Poll GET /v1/tasks/<task_id> or set a callback_url on the request to receive the finished audio URL.

    FAQs

    People also ask

    ElevenLabs TTS is a text-to-speech endpoint that turns any string into spoken audio. Send the text and a voice ID, optionally pick the model variant and output format, optionally tune the voice settings, and Unifically returns a finished audio URL after the task completes.

    Four. Flash v2.5 (~75 ms first audio, 32 languages, 40,000 character limit), Turbo v2.5 (~250 ms, 32 languages, 40,000 chars), Multilingual v2 (29 languages, 10,000 chars, the default for long-form work), and V3 (70+ languages, 5,000 chars, the most expressive).

    Eighteen total. MP3 at 32, 64, 96, 128, and 192 kbps. Opus at 32, 64, 96, 128, and 192 kbps. Raw PCM at 16, 22.05, 24, 44.1, and 48 kHz. WAV 44.1 kHz. And telephony-grade ulaw_8000 and alaw_8000 for 8 kHz voice circuits. The default is mp3_44100_128.

    Character limit depends on the model. Flash v2.5 and Turbo v2.5 accept up to 40,000 characters per call. Multilingual v2 caps at 10,000. V3 caps at 5,000. Past these limits, split the text into multiple calls.

    Yes. Pass an integer seed in the range 0 to 4,294,967,295. The same text, voice, model, voice settings, and seed return the same audio. Useful for regression tests and deterministic playback in games and assistive tech.

    Five things. Stability trades emotion for monotone, similarity boost controls how closely the output matches the source voice, style amplifies the speaker's character at the cost of latency, speed multiplies normal pace, and use_speaker_boost adds extra similarity at a small latency cost.

    Yes. Voice cloning is set up inside ElevenLabs and the resulting voice ID becomes selectable here. Pull your voice list from the resources catalog and any returned ID, including clones, drops in as the voice parameter.

    ElevenLabs sub-models

    Six routes share this page — text-to-speech, dialogue, sound effects, audio isolation, speech-to-text, and voice changer. Pick the one that matches your audio job.