Back to pipelines

Video Reframe

Auto-crop horizontal video to vertical with AI active-speaker framing. TikTok-ready in one call — frames the speaker in every shot and cuts cleanly at every boundary.

2

Best for

  • Reframing 16:9 podcast / interview / documentary footage to 9:16 TikTok/Reels/Shorts
  • Tracking a specific subject (host, product, screen) when content has multiple focal points
  • Producing 1:1 Instagram squares from horizontal masters with active-speaker tracking
  • Closing the long-form-to-shorts loop when chained with video-trim and captions

When to use

  • After video-trim — trim picks the moment, reframe picks the framing
  • Before captions — reframe first so caption text fits the 9:16 canvas
  • Standalone for users with already-edited horizontal clips that need vertical versions

Tips

  • Supply a diarized transcript for interviews and panels — it drives per-shot active-speaker framing far more reliably than vision alone
  • Shots with no clear single subject (screen-shares, wide establishing shots) letterbox by design — that is the safe choice, not a miss
  • There are no pan / zoom / smoothing knobs to tune — the camera director is deterministic lock-and-cut

Recipes using this pipeline

Video Reframe — Horizontal to Vertical with Speaker Tracking

Most viral video lives at 9:16 (TikTok, Reels, Shorts) but most footage is shot 16:9. Video Reframe closes the gap: it finds the active speaker in every shot and frames them in a clean vertical crop — held perfectly still through the shot, cutting only where the source cuts. No manual editing, no camera knobs to tune.

How It Works

  1. Detect shot boundaries — FFmpeg scene-cut detector partitions the source so every framing decision is scoped to one continuous shot.
  2. Localize faces and identify the cast — a vision pass plus a CV face detector find and cluster faces across each shot's keyframes, giving every person a stable identity.
  3. Resolve the focal subject per shot — audio-first: a diarized transcript names the active speaker; otherwise a one-label-per-shot vision pass and the CV faces decide. The speaker gets the frame; an ambiguous shot letterboxes (shows everyone) rather than guessing.
  4. Plan the camera, deterministically — a pure planner turns the per-shot focal into a lock-and-cut plan: one static crop per shot, hard cuts at boundaries, sub-second shots merged into neighbours. No pans, no drift, no motion artifacts — the same inputs always produce the same plan.
  5. Render — FFmpeg applies the static per-shot crop (or a centered letterbox for whole-frame shots), keeping audio intact.

Supply a transcript for multi-speaker tracking

For interviews and panels, pass a diarized transcript (SRT) covering the same time range as the video. Its speaker turns drive the focal choice, so a two-shot follows whoever is actually talking — not just the largest face on screen. Without a transcript the planner still works, falling back to the per-shot vision label and CV face positions.

Frequently Asked Questions

What aspect ratios are supported?
9:16 (TikTok/Reels/Shorts), 1:1 (Instagram square), 4:5 (Instagram portrait), and 16:9 (passthrough — no reframing applied).
What if my source is already vertical?
The pipeline detects this and returns the source as-is with meta.skipped=already_target_aspect.
How does it pick who to follow?
Audio-first. If you supply a diarized transcript it frames the active speaker per shot; otherwise a per-shot vision label plus a CV face detector decide. Shots with no clear single subject — screen-shares, wide establishing shots — letterbox to show the whole frame rather than guessing.
Does it ever pan or zoom around?
No. Each shot gets one static crop held perfectly still; the camera only ever cuts at a shot boundary. This makes the jittery-pan / hunting-camera artifact class structurally impossible.
What does it cost?
2 credits for clips up to 1 minute, 3.5 for 1-2 minutes, 5 credits for anything longer.
Can I chain it with video-trim?
Yes — typical chain is video-trim (length) → video-reframe (aspect) → captions (burn-in). Most users will want all three for a finished short.

Explore more pipelines

See all →
Video Generator
12–96
Video Generator
Image Generator
0.5–7
Image Generator
Audio Generator
0.6–1.2
Audio Generator
Music Generator
from 3
Music Generator