Unifically LogoUnificAlly
Model logo

ElevenLabs Speech-to-Text API

Transcribe audio to text with AI

Click or drag & dropMP3, WAV, FLAC, OGG · Max 100MB
Speaker Diarization
Identify who is speaking
Tag Audio Events
Tag events like (laughter), (music), (footsteps)
Output

Your generated audio will appear here

Features

What ElevenLabs Speech-to-Text API offers

Upload audio_url file input up to the configured size cap
Optional language_code or leave blank for automatic language detection
Speaker diarization with optional num_speakers or diarization_threshold when diarize is true
timestamps_granularity: none, word, or character for timed captions
tag_audio_events for markers such as (laughter) or (music) in the transcript
entity_detection modes: all, PII, PHI, PCI, offensive language, or none
Optional keyterms list, up to 100 comma separated boosts for names and jargon
Temperature and seed knobs when you need reproducible or varied decoding

Use cases

Built for

Primary

Meeting notes from recorded calls with who spoke when diarization is on

#2

Podcast transcripts with word level timing for scroll highlighting

#3

Compliance review when PHI entity_detection should flag sensitive spans

#4

Legal and medical workflows when you enable entity_detection carefully

#5

Caption exports for short form video using word timestamps

#6

Research corpora built from field recordings with domain keyterms

FAQ

About ElevenLabs Speech-to-Text API

You upload audio through Unifically. The job returns text with optional diarization, timestamps, audio event tags, entity spans, and boosted vocabulary.

When tag_audio_events is enabled, non speech sounds can appear as inline cues such as (laughter) or (music) according to the provider behavior.

The UI accepts comma separated terms. The payload sends up to one hundred trimmed strings to bias recognition for product names and technical words.

Unifically lists ElevenLabs Speech to Text at $0.001056 per second of audio processed.