Payload and output preview

Payload example

{
  "version": "v1",
  "output": {
    "width": 1080,
    "height": 1920,
    "fps": 30
  },
  "assets": [
    { "id": "talking-head", "type": "video", "url": "https://pub-2ad5592bc4ca44abb609acfc0b7c5ceb.r2.dev/reel-forge-website-assets/talking%20head%20runner.mp4" }
  ],
  "composition": {
    "timeline": [
      {
        "id": "video-layer",
        "type": "video",
        "asset_id": "talking-head",
        "time": { "start_seconds": 0, "duration_seconds": 13.8 }
      }
    ],
    "captions": {
      "provider": "assemblyai",
      "preset": "karaoke_yellow",
      "mode": "phrase_karaoke",
      "max_chars_per_segment": 20,
      "correct_text": "Not going to pretend I want to do this. I don't. But I also know I'll feel better after and hate myself if I don't. So - off we go.",
      "layout": { "x": "8%", "y": "77%", "width": "84%", "height": "18%" },
      "style": { "font_size": 64, "highlight_color": "#FFEA00" },
      "words": [
        { "text": "Not", "start": 880, "end": 1000, "speaker": "A" },
        { "text": "going", "start": 1000, "end": 1160, "speaker": "A" },
        { "text": "to", "start": 1160, "end": 1320, "speaker": "A" },
        { "text": "pretend", "start": 1320, "end": 1640, "speaker": "A" },
        { "text": "I", "start": 1640, "end": 1760, "speaker": "A" },
        { "text": "want", "start": 1760, "end": 1880, "speaker": "A" },
        { "text": "to", "start": 1880, "end": 2040, "speaker": "A" },
        { "text": "do", "start": 2040, "end": 2200, "speaker": "A" },
        { "text": "this.", "start": 2200, "end": 2480, "speaker": "A" },
        { "text": "I", "start": 3120, "end": 3440, "speaker": "A" },
        { "text": "don't.", "start": 3440, "end": 3840, "speaker": "A" },
        { "text": "But", "start": 5040, "end": 5320, "speaker": "A" },
        { "text": "I", "start": 5320, "end": 5480, "speaker": "A" },
        { "text": "also", "start": 5480, "end": 5720, "speaker": "A" },
        { "text": "know", "start": 5720, "end": 5960, "speaker": "A" },
        { "text": "I'll", "start": 5960, "end": 6160, "speaker": "A" },
        { "text": "feel", "start": 6160, "end": 6320, "speaker": "A" },
        { "text": "better", "start": 6320, "end": 6560, "speaker": "A" },
        { "text": "after", "start": 6560, "end": 6840, "speaker": "A" },
        { "text": "and", "start": 6840, "end": 7160, "speaker": "A" },
        { "text": "hate", "start": 7160, "end": 7480, "speaker": "A" },
        { "text": "myself", "start": 7480, "end": 7800, "speaker": "A" },
        { "text": "if", "start": 7800, "end": 8000, "speaker": "A" },
        { "text": "I", "start": 8000, "end": 8240, "speaker": "A" },
        { "text": "don't.", "start": 8240, "end": 8640, "speaker": "A" },
        { "text": "So—", "start": 10080, "end": 10480, "speaker": "A" },
        { "text": "off", "start": 12720, "end": 13080, "speaker": "A" },
        { "text": "we", "start": 13080, "end": 13360, "speaker": "A" },
        { "text": "go.", "start": 13360, "end": 13680, "speaker": "A" }
      ]
    }
  }
}

Output preview

Featured caption example using karaoke_yellow + phrase_karaoke on a talking-head clip.

Caption Examples

Use composition.captions for programmatic caption rendering with preset + mode behavior.

1) One clear request example

This featured request shows the complete shape for a high-readability talking-head caption render.

Source asset

Words sample

The words array is the timing source (start / end in milliseconds).
Keep this provider-native whenever possible.

[
  { "text": "Not", "start": 880, "end": 1000, "speaker": "A" },
  { "text": "going", "start": 1000, "end": 1160, "speaker": "A" },
  { "text": "to", "start": 1160, "end": 1320, "speaker": "A" },
  { "text": "pretend", "start": 1320, "end": 1640, "speaker": "A" },
  { "text": "I", "start": 1640, "end": 1760, "speaker": "A" },
  { "text": "want", "start": 1760, "end": 1880, "speaker": "A" },
  { "text": "to", "start": 1880, "end": 2040, "speaker": "A" },
  { "text": "do", "start": 2040, "end": 2200, "speaker": "A" }
]
{
  "version": "v1",
  "output": { "width": 1080, "height": 1920, "fps": 30 },
  "assets": [
    { "id": "talking-head", "type": "video", "url": "https://pub-2ad5592bc4ca44abb609acfc0b7c5ceb.r2.dev/reel-forge-website-assets/talking%20head%20runner.mp4" }
  ],
  "composition": {
    "timeline": [
      {
        "id": "video-layer",
        "type": "video",
        "asset_id": "talking-head",
        "time": { "start_seconds": 0, "duration_seconds": 13.8 }
      }
    ],
    "captions": {
      "provider": "assemblyai",
      "preset": "karaoke_yellow",
      "mode": "phrase_karaoke",
      "max_chars_per_segment": 20,
      "correct_text": "Not going to pretend I want to do this. I don't. But I also know I'll feel better after and hate myself if I don't. So - off we go.",
      "layout": { "x": "8%", "y": "77%", "width": "84%", "height": "18%" },
      "style": { "font_size": 64, "highlight_color": "#FFEA00" },
      "words": [
        { "text": "Not", "start": 880, "end": 1000 },
        { "text": "going", "start": 1000, "end": 1160 },
        { "text": "to", "start": 1160, "end": 1320 }
      ]
    }
  }
}

Why this request works

  • Uses a single base video layer (no split-screen/no extra visual clutter)
  • Applies correct_text to improve punctuation/casing alignment
  • Uses phrase_karaoke for natural phrase grouping with active-word progression
  • Positions captions at a safe lower-third region for social readability

2) Caption presets comparison

Use the same source clip and timing input, then switch captions.preset to compare output style.

Quick preset picks

  • Fast default: tiktok_classic
  • Most social punch: bold_outline
  • Best active-word emphasis: karaoke_yellow
  • Editorial/luxury styles: typewriter, luxury_serif
  • Stylized/experimental: neon_glow, handwriting
  • High-contrast bubble look: soft_pill
PresetPreviewBest for
TikTok ClassicPreview videoclean social default readability
Bold OutlinePreview videohigh-impact, thick-stroke emphasis
Karaoke YellowPreview videoactive-word karaoke progression
Neon GlowPreview videostylized glow look for energetic edits
Pill CaptionsPreview videodark rounded bubble with strong contrast
TypewriterPreview videoeditorial mono-text presentation
HandwritingPreview videoinformal creator voice style
Luxury SerifPreview videopremium editorial aesthetic