ElevenLabs Speech-to-Text API

Transcribe audio to text with AI

·Features·FAQ

Documentation

Audio File *

Audio file to transcribe

0/1

Click or drag & dropMP3, WAV, FLAC, OGG · Max 100MB

Language Code

ISO 639-1/639-3 language code (leave empty for auto-detect)

Speaker Diarization

Identify who is speaking

Timestamps

Timestamp granularity for transcription

Tag Audio Events

Tag events like (laughter), (music), (footsteps)

Entity Detection

Detect entities in transcription

Temperature

Randomness (0-2). Higher = more diverse output

Seed

Seed for deterministic results (0-2147483647)

Key Terms

Comma-separated words to boost recognition accuracy (max 100)

Output

Your generated audio will appear here

Features

What ElevenLabs Speech-to-Text API offers

Upload audio_url file input up to the configured size cap

Optional language_code or leave blank for automatic language detection

Speaker diarization with optional num_speakers or diarization_threshold when diarize is true

timestamps_granularity: none, word, or character for timed captions

tag_audio_events for markers such as (laughter) or (music) in the transcript

entity_detection modes: all, PII, PHI, PCI, offensive language, or none

Optional keyterms list, up to 100 comma separated boosts for names and jargon

Temperature and seed knobs when you need reproducible or varied decoding

Use cases

Built for

Primary

Meeting notes from recorded calls with who spoke when diarization is on

Podcast transcripts with word level timing for scroll highlighting

Compliance review when PHI entity_detection should flag sensitive spans

Legal and medical workflows when you enable entity_detection carefully

Caption exports for short form video using word timestamps

Research corpora built from field recordings with domain keyterms

FAQ

About ElevenLabs Speech-to-Text API

You upload audio through Unifically. The job returns text with optional diarization, timestamps, audio event tags, entity spans, and boosted vocabulary.

When tag_audio_events is enabled, non speech sounds can appear as inline cues such as (laughter) or (music) according to the provider behavior.

The UI accepts comma separated terms. The payload sends up to one hundred trimmed strings to bias recognition for product names and technical words.

Unifically lists ElevenLabs Speech to Text at $0.001056 per second of audio processed.