// AI NEWS · MODEL RELEASE

Gemma 4 Runs at 1,800+ Tokens per Second on Cerebras — Fast Multimodal Inference Goes Open

Cerebras put Google's open Gemma 4 31B on its inference cloud at over 1,800 tokens per second, bringing image-and-text understanding to near-instant speeds.

2026-06-30 · by Moe Ameen

What happened

Cerebras announced on June 29, 2026 that it is serving Google DeepMind's open-weight Gemma 4 31B model on its inference cloud at over 1,800 tokens per second — a speed it frames as roughly 35 times a typical GPU endpoint — with the first answer token, reasoning included, returning in about 1.5 seconds. The release is in public preview on the Cerebras Inference Cloud. The notable part is not just the speed but the modality: Gemma 4 is multimodal on input, so this is fast inference over images as well as text.

Gemma 4 is Google DeepMind's open-weight model family, released under the Apache 2.0 license earlier in 2026. Every variant accepts image and text input, and the 31B model that Cerebras is hosting is a dense model Google positions around quality and efficiency rather than raw parameter count. It can read screenshots, documents, charts, forms, and diagrams and write text about them — its output is text, not images or video. Cerebras cites third-party measurements placing the model around the level of compact frontier models on general intelligence benchmarks while running far faster on its hardware.

The broader signal is that high-speed, low-cost inference is no longer limited to text-only or closed models. An open, commercially licensed multimodal model now runs at interactive speeds you can build real-time and high-volume workflows on. As always with preview launches, exact throughput, pricing, and availability will move, so treat the specific figures as a snapshot of the announcement.

Why it matters for creators

  • Multimodal understanding at interactive speed means you can analyze images — competitor posts, screenshots, charts, product photos — in bulk and near-instantly, not one slow call at a time.
  • Gemma 4 is open under Apache 2.0, so the model behind these speeds can be self-hosted or fine-tuned, not just rented through one vendor.
  • Fast, cheap inference lowers the cost of the "reading and drafting" half of a content workflow — turning raw inputs into briefs and rough copy at volume.
  • It is still a text-output model: it reads images but generates none, and it publishes nothing. Model speed is not the same as speed to a finished, scheduled post.
  • Open multimodal models reaching frontier-adjacent quality at a fraction of the cost shifts the build-vs-buy math for anyone considering their own pipeline.

How to act on this with Kompozy

The temptation when a model gets this fast is to wire your content pipeline straight to whichever endpoint is quickest this month. That is a trap — the fastest endpoint changes constantly, and the speed of a model is not the speed of shipping content. Kompozy runs generation server-side on managed Claude and OpenAI models with the model layer abstracted away, so you never chase a leaderboard or re-wire when a new fast model lands. You write a Persona Brief and approve outputs; the engine keeps generating, scheduling, and publishing across all nine platforms regardless of what is fastest underneath.

Where a model like Gemma 4 earns its keep is the front of the workflow — reading your raw inputs and drafting fast. Use it (open, so on your own hardware if you want) to turn a folder of screenshots, charts, or transcripts into briefs and hooks, then bring those into Kompozy, which does the parts no model does: rendering persona and avatar video, carousels, quote cards, and infographics, rewriting in your brand voice, captioning and reframing clips per platform, and publishing on a schedule. There is also a direct content play in the news itself — fast open multimodal AI is exactly the kind of timely topic your audience is searching this week. Drop your take into Kompozy and it fans one point of view into a blog post, a carousel explainer, short captioned clips, and platform-native posts, then ships them. Being early and clear on a story like this is how one take becomes a week of content.

Quick takeaways

  • Cerebras is serving Google's open Gemma 4 31B at 1,800+ tokens per second in public preview, with first token in about 1.5 seconds.
  • Gemma 4 is multimodal on input (image + text) and open under Apache 2.0; its output is text, not images or video.
  • Fast, cheap, open multimodal inference makes high-volume image-and-document reading practical for content workflows.
  • Model speed is not the same as a finished post — Kompozy abstracts the model layer and turns drafts into 18 formats published across nine platforms.

Frequently asked questions

What is Gemma 4 and what makes this Cerebras news notable?

Gemma 4 is Google DeepMind's open-weight, multimodal model family (Apache 2.0), which accepts image and text input and outputs text. Cerebras announced on June 29, 2026 that it serves the Gemma 4 31B model at over 1,800 tokens per second — fast multimodal inference on an open model, in public preview.

Can Gemma 4 generate images or video?

No. Gemma 4 is multimodal on the input side — it can read images, and the family also handles audio and video frames — but its output is text. It understands and describes visual inputs; it does not render images, video, or audio.

Does fast inference like this change my content workflow?

It makes the reading-and-drafting step cheaper and faster, but it does not produce finished posts. A model still generates no video, carousels, or branded media and publishes nothing. To turn fast drafts into scheduled, on-brand content across platforms you pair it with a content engine like Kompozy.

How should creators use this with Kompozy?

Use a fast model like Gemma 4 to read inputs and draft copy, then bring the drafts into Kompozy to generate persona/avatar video, carousels, quote cards, and images, rewrite in your brand voice via the Persona Brief, and schedule and publish across nine platforms — without wiring your pipeline to any single model endpoint.

Related news

← All AI news · Get started →