Video summarizer with Vercel AI SDK and Ittybit

View Markdown

Use Ittybit to extract the audio track from a video, then pipe it through the Vercel AI SDK to transcribe and summarize the content in a streaming response. The user sees the summary appear token-by-token while the model is still generating.

Install dependencies

npm install ai @ai-sdk/openai

Extract audio with Ittybit

Create a Server Action that POSTs an audio task to Ittybit, then polls until the extracted audio file is ready.

// app/actions.ts
'use server';

async function extractAudio(videoUrl: string): Promise<string> {
  const res = await fetch('https://api.ittybit.com/jobs', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${process.env.ITTYBIT_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      input: videoUrl,
      kind: 'audio',
      options: { format: 'mp3' },
    }),
  });

  const task = await res.json();
  return pollTask(task.id);
}

async function pollTask(taskId: string): Promise<string> {
  while (true) {
    const res = await fetch(`https://api.ittybit.com/jobs/${taskId}`, {
      headers: {
        Authorization: `Bearer ${process.env.ITTYBIT_API_KEY}`,
      },
    });
    const task = await res.json();

    if (task.status === 'failed') {
      throw new Error(`Task ${taskId} failed`);
    }
    if (task.status === 'succeeded') {
      return task.output.url;
    }

    await new Promise((r) => setTimeout(r, 2000));
  }
}

Stream the summary with the AI SDK

Create an API route that takes a video URL, extracts the audio, and uses streamText to generate a summary. The AI SDK handles chunked streaming back to the client.

// app/api/summarize/route.ts
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';

export async function POST(req: Request) {
  const { videoUrl } = await req.json();

  // 1. Extract audio via Ittybit
  const audioUrl = await extractAudio(videoUrl);

  // 2. Stream a summary using the AI SDK
  const result = streamText({
    model: openai('gpt-4o'),
    messages: [
      {
        role: 'system',
        content:
          'You are a video summarizer. The user will provide a URL to an audio track ' +
          'extracted from a video. Listen to the audio and return a structured summary ' +
          'with a title, a one-sentence TL;DW, and timestamped chapters.',
      },
      {
        role: 'user',
        content: [
          {
            type: 'text',
            text: 'Summarize the content from this audio track:',
          },
          {
            type: 'file',
            data: new URL(audioUrl),
            mimeType: 'audio/mpeg',
          },
        ],
      },
    ],
  });

  return result.toDataStreamResponse();
}

The extractAudio function from the previous step can be imported directly, or inlined in this file — it runs server-side either way.

Client component with useChat

On the client, the useChat hook from the AI SDK manages the streaming connection and renders tokens as they arrive.

// app/summarize/page.tsx
'use client';

import { useChat } from '@ai-sdk/react';
import { useState } from 'react';

export default function SummarizePage() {
  const [videoUrl, setVideoUrl] = useState('');

  const { messages, append, isLoading } = useChat({
    api: '/api/summarize',
  });

  function handleSubmit(e: React.FormEvent) {
    e.preventDefault();
    append({
      role: 'user',
      content: videoUrl,
    });
  }

  const summary = messages.find((m) => m.role === 'assistant');

  return (
    <div>
      <form onSubmit={handleSubmit}>
        <input
          type="url"
          value={videoUrl}
          onChange={(e) => setVideoUrl(e.target.value)}
          placeholder="https://example.com/video.mp4"
          required
        />
        <button type="submit" disabled={isLoading}>
          {isLoading ? 'Summarizing...' : 'Summarize'}
        </button>
      </form>

      {summary && (
        <div>
          <h2>Summary</h2>
          <pre style={{ whiteSpace: 'pre-wrap' }}>{summary.content}</pre>
        </div>
      )}
    </div>
  );
}

How it fits together

  1. The user pastes a video URL and submits the form.
  2. useChat sends the URL to /api/summarize.
  3. The route handler calls Ittybit to extract the audio track and polls until the file is ready.
  4. The audio URL is passed to streamText, which sends it to the model as a file attachment.
  5. The model processes the audio and streams back a summary token-by-token.
  6. useChat updates the UI in real time as chunks arrive.

The Ittybit extraction step runs once per video. If you want to avoid re-extracting audio for the same video, cache the audio URL keyed by the input video URL.

See also