Intelligence
Speech
Overview
The speech task transcribes spoken audio from a video or audio file into structured text with speaker identification, language detection, and time-aligned segments.
Creating a speech task
The speech task has no configurable options — all analysis is returned by default.
Output
When the task succeeds, the output includes:
| Field | Type | Description |
|---|---|---|
text | string[] | Transcript text segments |
languages | string[] | Detected languages (ISO 639-1) |
speakers | integer | Number of distinct speakers detected |
timeline | array | Time-coded transcript segments with start, end, text, and speaker |
The output JSON file contains the full transcript data:
Supported inputs
Speech tasks work with:
- Audio files (
.mp3,.m4a,.wav,.ogg,.aac,.flac) - Video files with embedded audio (
.mp4,.mov,.webm)
Common use cases
- Video and podcast transcription
- Generating subtitles or captions
- Searchable transcripts and AI summaries
- Creating text-based chapter markers