Back to pipelines

Transcription

Transcribe any video or audio file to text. Timestamps, speaker detection, and 99+ language support.

ModelElevenLabs Scribe
from0.8
+0.08/min

0 – 20

Recipes using this pipeline

AI Video Transcription

Upload any video or audio file and get back an accurate transcript with word-level timestamps. Speaker detection optional. Output as SRT subtitles or plain text — ready for captioning, editing, or repurposing.

What you can do with it

  • Caption videos — generate the SRT, then burn or overlay it for social
  • Repurpose long-form content — feed transcripts into any LLM for summaries, clip ideas, and social copy
  • Transcribe interviews + podcasts — multi-speaker recordings with diarization
  • Accessibility — make spoken content readable and searchable
  • Translation prep — clean transcript as the source for translated voiceovers

How it works

  1. Upload — video or audio, any common format (MP4, MOV, MP3, WAV, M4A, FLAC)
  2. Pick a language — or leave on auto-detect
  3. Toggle speaker detection — labels each segment with the speaker
  4. Download — SRT with timestamps + plain text transcript

Output formats

  • SRT — subtitle file with millisecond-accurate timestamps, drops straight into video editors
  • Plain text — clean readable transcript for editing, summarization, or translation

Frequently Asked Questions

What file formats are supported?
Any video or audio file — MP4, MOV, AVI, MP3, WAV, M4A, FLAC, and more. The pipeline automatically extracts the audio track from video files.
How accurate is the transcription?
Word-error rates under 5% for clear audio in major languages. Accuracy depends on audio quality, background noise, and speaker clarity.
What is speaker detection?
When enabled, the transcript labels each segment with the speaker who said it (Speaker A, Speaker B, etc.). Use the Number of Speakers hint to improve accuracy when you know how many people are in the recording.
Which languages are supported?
99 languages with automatic detection. Set a specific language for best accuracy when you know the source language.
How long does it take?
Typically 10-30% of the audio duration. A 10-minute recording takes 1-3 minutes to transcribe.

Explore more pipelines

See all →
Video Generator
12–96
Video Generator
Image Generator
0.5–7
Image Generator
Audio Generator
0.6–1.2
Audio Generator
Music Generator
from 3
Music Generator