claude-real-video review 2026. Honest scoring on scene-frame extraction, transcription, local privacy, the terminal barrier, and who should actually use it.
claude-real-video is a sharp, well-scoped open-source tool that does exactly what it claims: it lets Claude — or any LLM — watch a video by turning it into scene-change frames plus a transcript, all locally and for free. The scene-aware frame selection and privacy-first design are genuinely good. But it is a terminal utility and a perception layer, not a content tool — it makes a video legible to a model and stops there, so score it as an excellent input step, not a content system.
claude-real-video is a free, open-source command-line tool built around a real limitation: Claude cannot watch a video. It reasons over text and still images, not raw video files or audio waveforms. So the tool does the perception work — it detects scene changes, captures a frame at each meaningful transition, deduplicates the near-identical ones, transcribes the audio, and writes a manifest — and hands that to Claude, which then answers grounded in what the video actually shows and says.
This review is about whether the tool earns its place in your workflow and who it actually fits. I run a competing content engine, so the disclosure is upfront: Kompozy is not a video-analysis tool and is not trying to be an alternative to letting Claude watch a video, so I have no reason to talk this down — and it is good at its narrow job. The honest read is that claude-real-video nails the perception step and deliberately does nothing beyond it.
Two facts frame the whole verdict: it is a terminal tool with real dependencies (Python, ffmpeg, yt-dlp, optionally Whisper), and it is an input layer that produces frames and text for a model, not posts for an audience. Everything below is scored against the tool's public repository state as of 2026-07-02.
claude-real-video (run as the command crv) is a local, MIT-licensed Python tool that gives Claude or any LLM the ability to watch a video. It uses ffmpeg to detect scene changes and extract a frame at each visual transition rather than sampling at a fixed interval, then runs a deduplication pass against a sliding window to drop redundant frames. For audio it prefers an existing subtitle track (SRT/VTT) and falls back to OpenAI's Whisper, with an option to preserve the full soundtrack for audio-capable models. It can pull a video from a YouTube, Instagram, or TikTok URL via yt-dlp or work on a local file, and it caps how many frames reach the model (150 by default), with flags to tune scene sensitivity, dedup, language, and audio. The output is a folder of JPEG frames, a transcript, and a MANIFEST file. Everything runs on your machine with nothing uploaded. It is a perception layer, not an interpretation layer and not a content workflow. It does not generate captions, cut clips, reframe video, write posts or blogs, produce images or carousels, enforce a brand voice, or publish anything. It prepares the inputs; the understanding is Claude's, and any content you might want to make from the video is a separate job.
The clearest fit is a developer or technical creator who wants Claude to actually analyze a specific video — mining a competitor's clip for structure and hooks, interrogating a long course or webinar, checking what is really on screen versus what a title claims, or doing any of this on footage that cannot leave the machine. The local, no-cloud design makes it a strong choice for sensitive or unreleased video, and the CLI and flags reward people who like scriptable, repeatable control. Where it fits poorly: anyone who is not comfortable in a terminal, and anyone whose actual goal is producing content. If you want captioned shorts, carousels, a blog, or scheduled posts, this tool does none of that — it hands you an understanding, not an output.
| Dimension | Score | Why |
|---|---|---|
| Scene-aware frame selection | 4.5 / 5 | Detecting scene changes and capturing a frame per transition is the right approach, and the dedup pass keeps the model from drowning in near-identical stills. |
| Transcription & audio handling | 4.0 / 5 | Preferring existing subtitles before falling back to Whisper is smart, and optional soundtrack preservation covers audio-capable models. |
| Local privacy / no-cloud processing | 4.5 / 5 | Everything runs on your machine with nothing uploaded — a real advantage for sensitive footage. |
| Source support (URLs + local files) | 4.0 / 5 | yt-dlp brings in YouTube, Instagram, and TikTok links alongside local files, covering most real inputs. |
| Configurability | 4.0 / 5 | Flags for scene sensitivity, frame ceiling, dedup window, language, and audio give useful control with sensible defaults. |
| Output usefulness for an LLM | 4.0 / 5 | Frames plus transcript plus a manifest is a clean, model-ready package; interpretation quality still depends on the LLM that reads it. |
| Ease of use / accessibility | 2.5 / 5 | A terminal tool with Python, ffmpeg, and yt-dlp dependencies — powerful for developers, a wall for everyone else. |
| Value (free / open-source) | 4.5 / 5 | MIT-licensed and free to run yourself; you only pay for whatever model reads the output. |
| Content creation & publishing | 1.0 / 5 | None, by design. It analyzes video and produces nothing you can post — captioning, generation, and publishing are all out of scope. |
There is not much to analyze on price, and that is a point in the tool's favor: claude-real-video is free and open-source under an MIT license. You install it, you run it, and the only money involved is whatever a model costs when it reads the output — Whisper transcription runs locally, and the frames and transcript can go to Claude or any LLM you already pay for. For a utility this focused, free is the right price and there is no upsell lurking.
The honest cost is not dollars, it is setup and scope. You pay in dependencies and terminal time to get it running, and you pay in the fact that it stops at analysis. The tool gives you an understanding of a video; turning that understanding into content — clips, captions, posts, a blog — is work it does not do and does not price for, because it is not that kind of product.
Compared to paid video-understanding APIs or content platforms, claude-real-video wins outright on cost for its narrow job. Just budget realistically: if the outcome you want is published content, the free analysis is one early step, and the tools that generate and distribute that content sit downstream and are where the real spend lives.
| Use case | Fit | Why |
|---|---|---|
| Getting Claude to analyze a specific video | Strong | Scene frames plus a transcript are exactly the inputs an LLM needs to answer grounded, in-video questions. |
| Private, local analysis of sensitive footage | Strong | Everything processes on your machine with no cloud upload, so unreleased or confidential video never leaves the device. |
| Mining a competitor video for hooks and structure | Strong | The transcript and frames make it easy to see how a video is built and pull out what works. |
| Scriptable, repeatable video-to-context in a dev pipeline | OK | The CLI and flags suit automation, though you own the integration and the model call. |
| Non-technical creators who want quick insight | Weak | The terminal, Python, and ffmpeg dependencies are a real barrier for anyone who does not code. |
| Turning a video into captioned shorts and posts | Weak | It generates nothing — no clip export, captioning, reframing, or publishing exists in the tool. |
| Producing a blog, newsletter, or carousel from a video | Weak | Written and visual content generation is entirely out of scope; the output is model-ready frames and text only. |
The fair way to place these is on opposite sides of one workflow. claude-real-video is the read layer: it makes a video legible to Claude so you can understand what is in it. Kompozy is the create-and-ship layer: it generates finished content and publishes it. They do not compete — a model watching a video and an engine producing posts are different jobs — and honestly, they chain well. Analyze a video with claude-real-video to decide what is worth making, then hand the source to Kompozy to make it.
Where claude-real-video stops is exactly where Kompozy starts. Point Kompozy at the same long-form video and Clipped Shorts detects the strongest moments and cuts them to vertical with branded captions; the session also becomes a recap blog, a carousel, quote cards, native text posts, and a newsletter, all governed by a Persona Brief so the voice stays consistent, then Autopilot schedules the set across nine platforms from one queue. The honest framing: claude-real-video is the best free way to let Claude watch a video, and it should stay exactly that focused. If what you actually want is content out the other end, that is a separate engine — and it is the one worth pairing with this.
If you want Claude or another LLM to genuinely watch and analyze a video — locally, privately, and for free — yes. The scene-aware frame selection, transcription, and manifest are well-executed and the MIT license makes it low-risk. It is less worth it if you are not comfortable in a terminal, or if your real goal is producing content, because it generates nothing you can post.
It turns a video into inputs an LLM can read: it detects scene changes and extracts a frame at each meaningful transition, deduplicates near-identical frames, transcribes the audio with Whisper (or uses existing subtitles), optionally preserves the soundtrack, and writes a MANIFEST that ties it together. Claude then answers grounded in what the video shows and says.
Effectively yes. It is a Python command-line tool that depends on ffmpeg and yt-dlp and is driven by flags, so you should be comfortable in a terminal. There is no graphical app or one-click install, and it is a developer utility rather than a Claude Code plugin.
It is free and open-source under an MIT license, and it processes everything locally with nothing uploaded to a cloud service — so sensitive or unreleased footage stays on your machine. You only pay for whatever model reads the output.
claude-real-video is a local CLI that prepares frames and a transcript for any LLM you choose. NotebookLM is a hosted, no-terminal tool for source-grounded understanding, and services like TwelveLabs offer managed video-understanding APIs. claude-real-video trades polish for locality, control, and being free.
No. It is a perception layer — it makes a video readable to a model and stops there. Cutting clips, adding captions, reframing for vertical, and publishing to platforms are all out of scope. For that, bring the video into a content engine like Kompozy.
Use claude-real-video to decide what is worth making, then hand the source to a generation-and-publishing engine. Kompozy cuts the best moments into captioned shorts and turns the same session into a carousel, quote cards, a blog, and a newsletter, then publishes them across nine platforms from one queue.
It is terminal-only with real dependencies, it produces nothing publishable, it has no captioning, clipping, or reframing, and interpretation quality still depends on whatever LLM reads its output. Those are scope choices, not bugs — it is an input tool by design.
See claude-real-video vs Kompozy comparison → · Get Started →