Google DeepMind's open-weight multimodal model family — reads images and audio, generates text, and runs fast and cheap.
Last verified · 2026-06-30 · by Moe Ameen
Gemma 4 is the open-weight model family from Google DeepMind, released under the Apache 2.0 license in 2026. It is the multimodal successor to the Gemma line, and the headline change is that it natively understands more than text: every variant takes both image and text input, the smaller models (E2B and E4B) also accept audio, and the models can reason over video frames at variable resolution. The output is text. That distinction matters — Gemma 4 is excellent at reading a screenshot, a chart, a scanned document, a form, or a UI state and writing about it, but it does not render images, video, or audio itself.
It ships in a spread of sizes rather than one flagship. There are compact, on-device-friendly variants (the "Effective" E2B and E4B models), a 26B mixture-of-experts model that activates only a fraction of its parameters per token, and a 31B dense model at the top. The smaller models carry a 128K-token context window and the larger ones go up to 256K, and the family is trained across 140+ languages, with function-calling and structured JSON output supported.
Google's pitch for Gemma 4 is intelligence-per-parameter: the models are tuned to punch above their size rather than chase raw scale, which is what makes them practical to run on modest hardware or serve cheaply at high throughput. At launch the 31B model ranked among the top open models on public text leaderboards, and the family builds on a Gemma ecosystem that has been downloaded hundreds of millions of times. Because it is open-weight, you can self-host it, fine-tune it, or reach it through hosted inference providers — and exact sizes, context limits, and benchmark standings shift as the family ships, so treat any single number as a snapshot.
Gemma 4's real edge for creators is that it *reads* — feed it a competitor's carousel screenshot, a performance dashboard, a chart from a report, or a frame from your own footage, and it writes back structured copy, hooks, or a content brief in seconds. Because it is small and cheap to run, you can do that at volume: turn a folder of screenshots into a week of post ideas in one pass. What it never does is produce the post. Gemma 4's output is plain text; it renders no carousel, no quote card, no avatar video, no scheduled feed.
That is exactly the seam Kompozy fills. Take Gemma 4's drafts and briefs into Kompozy and it produces the media the model can't — face-locked persona images, persona and avatar video, multi-slide carousels, quote graphics, infographics — then rewrites the copy in your own voice through a Persona Brief, burns in branded captions, reframes anything vertical per platform, and schedules and publishes across all nine destinations (Instagram, TikTok, YouTube, LinkedIn, X, Pinterest, Facebook, Threads, plus email and blog) from one queue. The clean division of labor: Gemma 4 is the fast, cheap pair of eyes that reads your inputs and drafts the words; Kompozy is the production line that turns those words into 18 finished, on-brand formats and ships them.
Gemma 4 is Google DeepMind's open-weight, multimodal model family, released under the Apache 2.0 license in 2026. Every variant accepts image and text input (the smaller E2B and E4B models also take audio), and the output is text. It comes in several sizes, from compact on-device models up to 26B mixture-of-experts and 31B dense variants.
No. Gemma 4 is multimodal on the input side — it can read images, audio, and video frames — but its output is text. It writes, reasons, transcribes, and describes; it does not render images, video, or audio. To turn its output into visual posts you pair it with a tool that generates media, such as Kompozy.
The weights are open under the Apache 2.0 license, so there is no fee for the model itself and you can self-host or fine-tune it commercially. Your real cost is the hardware or hosted-inference bill to run it. Exact sizes and limits change as the family ships, so check Google's model card for current details.
It spans compact on-device variants (the "Effective" E2B and E4B), a 26B mixture-of-experts model, and a 31B dense model at the top. Smaller models carry a 128K context window; larger ones go up to 256K. Treat specific figures as a snapshot — the lineup evolves.
Gemma 4 writes the text but publishes nothing and renders no media. Bring its drafts into Kompozy to generate carousels, persona/avatar video, quote cards, and images, rewrite the copy in your brand voice via the Persona Brief, and schedule and publish across Instagram, TikTok, LinkedIn, X, YouTube, and more from one queue.