Cerebras put Google's open Gemma 4 31B on its inference cloud at over 1,800 tokens per second, bringing image-and-text understanding to near-instant speeds.
2026-06-30 · by Moe Ameen
Cerebras announced on June 29, 2026 that it is serving Google DeepMind's open-weight Gemma 4 31B model on its inference cloud at over 1,800 tokens per second — a speed it frames as roughly 35 times a typical GPU endpoint — with the first answer token, reasoning included, returning in about 1.5 seconds. The release is in public preview on the Cerebras Inference Cloud. The notable part is not just the speed but the modality: Gemma 4 is multimodal on input, so this is fast inference over images as well as text.
Gemma 4 is Google DeepMind's open-weight model family, released under the Apache 2.0 license earlier in 2026. Every variant accepts image and text input, and the 31B model that Cerebras is hosting is a dense model Google positions around quality and efficiency rather than raw parameter count. It can read screenshots, documents, charts, forms, and diagrams and write text about them — its output is text, not images or video. Cerebras cites third-party measurements placing the model around the level of compact frontier models on general intelligence benchmarks while running far faster on its hardware.
The broader signal is that high-speed, low-cost inference is no longer limited to text-only or closed models. An open, commercially licensed multimodal model now runs at interactive speeds you can build real-time and high-volume workflows on. As always with preview launches, exact throughput, pricing, and availability will move, so treat the specific figures as a snapshot of the announcement.
The temptation when a model gets this fast is to wire your content pipeline straight to whichever endpoint is quickest this month. That is a trap — the fastest endpoint changes constantly, and the speed of a model is not the speed of shipping content. Kompozy runs generation server-side on managed Claude and OpenAI models with the model layer abstracted away, so you never chase a leaderboard or re-wire when a new fast model lands. You write a Persona Brief and approve outputs; the engine keeps generating, scheduling, and publishing across all nine platforms regardless of what is fastest underneath.
Where a model like Gemma 4 earns its keep is the front of the workflow — reading your raw inputs and drafting fast. Use it (open, so on your own hardware if you want) to turn a folder of screenshots, charts, or transcripts into briefs and hooks, then bring those into Kompozy, which does the parts no model does: rendering persona and avatar video, carousels, quote cards, and infographics, rewriting in your brand voice, captioning and reframing clips per platform, and publishing on a schedule. There is also a direct content play in the news itself — fast open multimodal AI is exactly the kind of timely topic your audience is searching this week. Drop your take into Kompozy and it fans one point of view into a blog post, a carousel explainer, short captioned clips, and platform-native posts, then ships them. Being early and clear on a story like this is how one take becomes a week of content.
Gemma 4 is Google DeepMind's open-weight, multimodal model family (Apache 2.0), which accepts image and text input and outputs text. Cerebras announced on June 29, 2026 that it serves the Gemma 4 31B model at over 1,800 tokens per second — fast multimodal inference on an open model, in public preview.
No. Gemma 4 is multimodal on the input side — it can read images, and the family also handles audio and video frames — but its output is text. It understands and describes visual inputs; it does not render images, video, or audio.
It makes the reading-and-drafting step cheaper and faster, but it does not produce finished posts. A model still generates no video, carousels, or branded media and publishes nothing. To turn fast drafts into scheduled, on-brand content across platforms you pair it with a content engine like Kompozy.
Use a fast model like Gemma 4 to read inputs and draft copy, then bring the drafts into Kompozy to generate persona/avatar video, carousels, quote cards, and images, rewrite in your brand voice via the Persona Brief, and schedule and publish across nine platforms — without wiring your pipeline to any single model endpoint.