// AI VIDEO ANALYSIS (DEVELOPER TOOL) REVIEW

claude-real-video Review (2026): Honest Verdict on the Tool That Lets Claude Watch a Video

claude-real-video review 2026. Honest scoring on scene-frame extraction, transcription, local privacy, the terminal barrier, and who should actually use it.

Last verified · 2026-07-02 · by Moe Ameen

The verdict

4.0 / 5

claude-real-video is a sharp, well-scoped open-source tool that does exactly what it claims: it lets Claude — or any LLM — watch a video by turning it into scene-change frames plus a transcript, all locally and for free. The scene-aware frame selection and privacy-first design are genuinely good. But it is a terminal utility and a perception layer, not a content tool — it makes a video legible to a model and stops there, so score it as an excellent input step, not a content system.

claude-real-video is a free, open-source command-line tool built around a real limitation: Claude cannot watch a video. It reasons over text and still images, not raw video files or audio waveforms. So the tool does the perception work — it detects scene changes, captures a frame at each meaningful transition, deduplicates the near-identical ones, transcribes the audio, and writes a manifest — and hands that to Claude, which then answers grounded in what the video actually shows and says.

This review is about whether the tool earns its place in your workflow and who it actually fits. I run a competing content engine, so the disclosure is upfront: Kompozy is not a video-analysis tool and is not trying to be an alternative to letting Claude watch a video, so I have no reason to talk this down — and it is good at its narrow job. The honest read is that claude-real-video nails the perception step and deliberately does nothing beyond it.

Two facts frame the whole verdict: it is a terminal tool with real dependencies (Python, ffmpeg, yt-dlp, optionally Whisper), and it is an input layer that produces frames and text for a model, not posts for an audience. Everything below is scored against the tool's public repository state as of 2026-07-02.

What claude-real-video is

claude-real-video (run as the command crv) is a local, MIT-licensed Python tool that gives Claude or any LLM the ability to watch a video. It uses ffmpeg to detect scene changes and extract a frame at each visual transition rather than sampling at a fixed interval, then runs a deduplication pass against a sliding window to drop redundant frames. For audio it prefers an existing subtitle track (SRT/VTT) and falls back to OpenAI's Whisper, with an option to preserve the full soundtrack for audio-capable models. It can pull a video from a YouTube, Instagram, or TikTok URL via yt-dlp or work on a local file, and it caps how many frames reach the model (150 by default), with flags to tune scene sensitivity, dedup, language, and audio. The output is a folder of JPEG frames, a transcript, and a MANIFEST file. Everything runs on your machine with nothing uploaded. It is a perception layer, not an interpretation layer and not a content workflow. It does not generate captions, cut clips, reframe video, write posts or blogs, produce images or carousels, enforce a brand voice, or publish anything. It prepares the inputs; the understanding is Claude's, and any content you might want to make from the video is a separate job.

Who claude-real-video is for

The clearest fit is a developer or technical creator who wants Claude to actually analyze a specific video — mining a competitor's clip for structure and hooks, interrogating a long course or webinar, checking what is really on screen versus what a title claims, or doing any of this on footage that cannot leave the machine. The local, no-cloud design makes it a strong choice for sensitive or unreleased video, and the CLI and flags reward people who like scriptable, repeatable control. Where it fits poorly: anyone who is not comfortable in a terminal, and anyone whose actual goal is producing content. If you want captioned shorts, carousels, a blog, or scheduled posts, this tool does none of that — it hands you an understanding, not an output.

Scoring breakdown

Dimension	Score	Why
Scene-aware frame selection	4.5 / 5	Detecting scene changes and capturing a frame per transition is the right approach, and the dedup pass keeps the model from drowning in near-identical stills.
Transcription & audio handling	4.0 / 5	Preferring existing subtitles before falling back to Whisper is smart, and optional soundtrack preservation covers audio-capable models.
Local privacy / no-cloud processing	4.5 / 5	Everything runs on your machine with nothing uploaded — a real advantage for sensitive footage.
Source support (URLs + local files)	4.0 / 5	yt-dlp brings in YouTube, Instagram, and TikTok links alongside local files, covering most real inputs.
Configurability	4.0 / 5	Flags for scene sensitivity, frame ceiling, dedup window, language, and audio give useful control with sensible defaults.
Output usefulness for an LLM	4.0 / 5	Frames plus transcript plus a manifest is a clean, model-ready package; interpretation quality still depends on the LLM that reads it.
Ease of use / accessibility	2.5 / 5	A terminal tool with Python, ffmpeg, and yt-dlp dependencies — powerful for developers, a wall for everyone else.
Value (free / open-source)	4.5 / 5	MIT-licensed and free to run yourself; you only pay for whatever model reads the output.
Content creation & publishing	1.0 / 5	None, by design. It analyzes video and produces nothing you can post — captioning, generation, and publishing are all out of scope.

Pros and cons

Pros

Actually lets Claude or any LLM watch a video — the perception step the model cannot do alone
Scene-change frame selection with deduplication, so the model sees meaningful shots, not redundant stills
Runs entirely locally with nothing uploaded — strong privacy for sensitive footage
Free and open-source under an MIT license, with no account or subscription
Handles YouTube, Instagram, and TikTok URLs plus local files via yt-dlp
Sensible defaults with flags to tune scene sensitivity, frame count, dedup, language, and audio
Prefers existing subtitles before re-transcribing, which is faster and often more accurate

Cons

Terminal-only: Python, ffmpeg, and yt-dlp stand between you and a result
Produces nothing publishable — frames and a transcript are for a model, not an audience
No clip export, captioning, or per-platform reframing
No generation of images, carousels, blogs, or newsletters
No publishing, scheduling, or brand-voice layer
Interpretation quality still rests on whatever LLM reads the output
A developer utility, not a polished app or a Claude Code plugin

Pricing analysis

There is not much to analyze on price, and that is a point in the tool's favor: claude-real-video is free and open-source under an MIT license. You install it, you run it, and the only money involved is whatever a model costs when it reads the output — Whisper transcription runs locally, and the frames and transcript can go to Claude or any LLM you already pay for. For a utility this focused, free is the right price and there is no upsell lurking.

The honest cost is not dollars, it is setup and scope. You pay in dependencies and terminal time to get it running, and you pay in the fact that it stops at analysis. The tool gives you an understanding of a video; turning that understanding into content — clips, captions, posts, a blog — is work it does not do and does not price for, because it is not that kind of product.

Compared to paid video-understanding APIs or content platforms, claude-real-video wins outright on cost for its narrow job. Just budget realistically: if the outcome you want is published content, the free analysis is one early step, and the tools that generate and distribute that content sit downstream and are where the real spend lives.

Use-case fit

Use case	Fit	Why
Getting Claude to analyze a specific video	Strong	Scene frames plus a transcript are exactly the inputs an LLM needs to answer grounded, in-video questions.
Private, local analysis of sensitive footage	Strong	Everything processes on your machine with no cloud upload, so unreleased or confidential video never leaves the device.
Mining a competitor video for hooks and structure	Strong	The transcript and frames make it easy to see how a video is built and pull out what works.
Scriptable, repeatable video-to-context in a dev pipeline	OK	The CLI and flags suit automation, though you own the integration and the model call.
Non-technical creators who want quick insight	Weak	The terminal, Python, and ffmpeg dependencies are a real barrier for anyone who does not code.
Turning a video into captioned shorts and posts	Weak	It generates nothing — no clip export, captioning, reframing, or publishing exists in the tool.
Producing a blog, newsletter, or carousel from a video	Weak	Written and visual content generation is entirely out of scope; the output is model-ready frames and text only.

Alternatives worth considering

Other open-source Claude video plugins (e.g. claude-video, claude-video-vision) — similar perception-layer approach with different packaging, some built as Claude Code plugins
NotebookLM — best for source-grounded understanding of documents and media with a friendlier, no-terminal interface
TwelveLabs — best for production video understanding and search via a managed API rather than a local CLI
Kompozy — best if the real goal is turning a video into published content, not analyzing it — generation plus multi-platform publishing

How Kompozy compares

The fair way to place these is on opposite sides of one workflow. claude-real-video is the read layer: it makes a video legible to Claude so you can understand what is in it. Kompozy is the create-and-ship layer: it generates finished content and publishes it. They do not compete — a model watching a video and an engine producing posts are different jobs — and honestly, they chain well. Analyze a video with claude-real-video to decide what is worth making, then hand the source to Kompozy to make it.

Where claude-real-video stops is exactly where Kompozy starts. Point Kompozy at the same long-form video and Clipped Shorts detects the strongest moments and cuts them to vertical with branded captions; the session also becomes a recap blog, a carousel, quote cards, native text posts, and a newsletter, all governed by a Persona Brief so the voice stays consistent, then Autopilot schedules the set across nine platforms from one queue. The honest framing: claude-real-video is the best free way to let Claude watch a video, and it should stay exactly that focused. If what you actually want is content out the other end, that is a separate engine — and it is the one worth pairing with this.

Frequently asked questions

Is claude-real-video worth it in 2026?

If you want Claude or another LLM to genuinely watch and analyze a video — locally, privately, and for free — yes. The scene-aware frame selection, transcription, and manifest are well-executed and the MIT license makes it low-risk. It is less worth it if you are not comfortable in a terminal, or if your real goal is producing content, because it generates nothing you can post.

What does claude-real-video actually do?

It turns a video into inputs an LLM can read: it detects scene changes and extracts a frame at each meaningful transition, deduplicates near-identical frames, transcribes the audio with Whisper (or uses existing subtitles), optionally preserves the soundtrack, and writes a MANIFEST that ties it together. Claude then answers grounded in what the video shows and says.

Do I need to code to use it?

Effectively yes. It is a Python command-line tool that depends on ffmpeg and yt-dlp and is driven by flags, so you should be comfortable in a terminal. There is no graphical app or one-click install, and it is a developer utility rather than a Claude Code plugin.

Is it free, and is my video private?

It is free and open-source under an MIT license, and it processes everything locally with nothing uploaded to a cloud service — so sensitive or unreleased footage stays on your machine. You only pay for whatever model reads the output.

How is it different from NotebookLM or a video-understanding API?

claude-real-video is a local CLI that prepares frames and a transcript for any LLM you choose. NotebookLM is a hosted, no-terminal tool for source-grounded understanding, and services like TwelveLabs offer managed video-understanding APIs. claude-real-video trades polish for locality, control, and being free.

Can it clip my video or post it to social?

No. It is a perception layer — it makes a video readable to a model and stops there. Cutting clips, adding captions, reframing for vertical, and publishing to platforms are all out of scope. For that, bring the video into a content engine like Kompozy.

What is the best way to turn a video I analyzed into content?

Use claude-real-video to decide what is worth making, then hand the source to a generation-and-publishing engine. Kompozy cuts the best moments into captioned shorts and turns the same session into a carousel, quote cards, a blog, and a newsletter, then publishes them across nine platforms from one queue.

What are its main limitations?

It is terminal-only with real dependencies, it produces nothing publishable, it has no captioning, clipping, or reframing, and interpretation quality still depends on whatever LLM reads its output. Those are scope choices, not bugs — it is an input tool by design.

Related deep guides

AI Content Repurposing — The complete methodology for turning one source into 25-35 pieces of native-format content across every platform — without producing AI slop.
Autonomous Content Creation — Most "autonomous" AI content is slop.
AI Brand Voice & Persona — Without a Persona Brief, every AI output averages to the LLM default voice.

See claude-real-video vs Kompozy comparison → · Get Started →