Video Reframe
Auto-crop horizontal video to vertical with AI active-speaker framing. TikTok-ready in one call — frames the speaker in every shot and cuts cleanly at every boundary.
Best for
- •Reframing 16:9 podcast / interview / documentary footage to 9:16 TikTok/Reels/Shorts
- •Tracking a specific subject (host, product, screen) when content has multiple focal points
- •Producing 1:1 Instagram squares from horizontal masters with active-speaker tracking
- •Closing the long-form-to-shorts loop when chained with video-trim and captions
When to use
- •After video-trim — trim picks the moment, reframe picks the framing
- •Before captions — reframe first so caption text fits the 9:16 canvas
- •Standalone for users with already-edited horizontal clips that need vertical versions
Tips
- ✓Supply a diarized transcript for interviews and panels — it drives per-shot active-speaker framing far more reliably than vision alone
- ✓Shots with no clear single subject (screen-shares, wide establishing shots) letterbox by design — that is the safe choice, not a miss
- ✓There are no pan / zoom / smoothing knobs to tune — the camera director is deterministic lock-and-cut
Recipes using this pipeline
Video Reframe — Horizontal to Vertical with Speaker Tracking
Most viral video lives at 9:16 (TikTok, Reels, Shorts) but most footage is shot 16:9. Video Reframe closes the gap: it finds the active speaker in every shot and frames them in a clean vertical crop — held perfectly still through the shot, cutting only where the source cuts. No manual editing, no camera knobs to tune.
How It Works
- Detect shot boundaries — FFmpeg scene-cut detector partitions the source so every framing decision is scoped to one continuous shot.
- Localize faces and identify the cast — a vision pass plus a CV face detector find and cluster faces across each shot's keyframes, giving every person a stable identity.
- Resolve the focal subject per shot — audio-first: a diarized transcript names the active speaker; otherwise a one-label-per-shot vision pass and the CV faces decide. The speaker gets the frame; an ambiguous shot letterboxes (shows everyone) rather than guessing.
- Plan the camera, deterministically — a pure planner turns the per-shot focal into a lock-and-cut plan: one static crop per shot, hard cuts at boundaries, sub-second shots merged into neighbours. No pans, no drift, no motion artifacts — the same inputs always produce the same plan.
- Render — FFmpeg applies the static per-shot crop (or a centered letterbox for whole-frame shots), keeping audio intact.
Supply a transcript for multi-speaker tracking
For interviews and panels, pass a diarized transcript (SRT) covering the same time range as the video. Its speaker turns drive the focal choice, so a two-shot follows whoever is actually talking — not just the largest face on screen. Without a transcript the planner still works, falling back to the per-shot vision label and CV face positions.