Intelligence

Speech

View Markdown

Overview

The speech task transcribes spoken audio from a video or audio file into structured text with speaker identification, language detection, and time-aligned segments.


Creating a speech task

curl -X POST "https://api.ittybit.com/tasks" \
-H "Authorization: Bearer ITTYBIT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "url": "https://example.com/video.mp4",
  "kind": "speech"
}'

The speech task has no configurable options — all analysis is returned by default.


Output

When the task succeeds, the output includes:

FieldTypeDescription
textstring[]Transcript text segments
languagesstring[]Detected languages (ISO 639-1)
speakersintegerNumber of distinct speakers detected
timelinearrayTime-coded transcript segments with start, end, text, and speaker
{
  "id": "task_abcdefgh12345678",
  "object": "task",
  "kind": "speech",
  "status": "succeeded",
  "inputs": [...],
  "outputs": [...],
  "created_at": 1735689825,
  "updated_at": 1735689886
}

The output JSON file contains the full transcript data:

{
  "kind": "speech",
  "text": [
    "Hello, and welcome to UkeTube. I'm Jesse Doe.",
    "And I'm John Doe. Today we're going to be learning Sandstorm by Darude."
  ],
  "languages": ["en"],
  "speakers": 2,
  "timeline": [
    {
      "start": 12.00,
      "end": 14.50,
      "speaker": 0,
      "text": "Hello, and welcome to UkeTube. I'm Jesse Doe."
    },
    {
      "start": 14.80,
      "end": 18.28,
      "speaker": 1,
      "text": "And I'm John Doe. Today we're going to be learning Sandstorm by Darude."
    }
  ]
}

Supported inputs

Speech tasks work with:

  • Audio files (.mp3, .m4a, .wav, .ogg, .aac, .flac)
  • Video files with embedded audio (.mp4, .mov, .webm)

Common use cases

  • Video and podcast transcription
  • Generating subtitles or captions
  • Searchable transcripts and AI summaries
  • Creating text-based chapter markers

Subtitles

Summary

Outline

On this page