Prompt-level instructions to avoid AI tells fail about one time in five. The brand-safety gate is the deterministic output-time filter that catches that 20% — the architecture, the regex strategy, the regeneration loop, how it reads from your Persona Brief banned-word list, and how to tune it without making your content sound stilted.
The brand-safety gate is a deterministic, regex-based filter that checks every autopilot output against the banned-word list in your Persona Brief at output time — after generation, not just in the prompt. It uses case-insensitive, word-boundary matching so "leverage" is caught but "Leverington Street" is not. Any match rejects the output and triggers regeneration with the offending phrase fed back into the prompt. After three failed regenerations, the output routes to manual review with the persistent phrase flagged. The gate exists because base models override prompt-level banned-word instructions roughly one time in five, and a prompt cannot catch what it failed to prevent.
Telling a model "never use the word leverage" works about eighty percent of the time. The other twenty percent is the entire problem with prompt-only brand safety: the one output in five that slips a banned phrase through is the one your audience reads, and on autopilot there is no human approval step waiting to catch it. A prompt is an instruction the model can ignore. A gate is a check the model cannot route around.
The brand-safety gate is the fourth of the four autopilot quality gates, and it is the one that keeps your content from quietly drifting back into generic AI voice. It reads the banned-word list straight out of your Persona Brief, compiles it into a deterministic regex set, and runs every generated output through that set before anything ships. A match means rejection and regeneration — not a warning, not a soft flag, a hard pass-or-fail.
This spoke is the architecture and tuning guide for that gate: why output-time checking beats prompt-time instruction, exactly how the regex matching avoids false positives, what belongs in the banned-word list and what does not, the regeneration prompt engineering that makes the rewrite work, industry-specific banned-word patterns, and the monthly tuning loop that keeps the list sharp without over-banning your own voice into mush. It pairs with the [fact-anchor gate](/autonomous/fact-anchor-gate), which runs immediately before it, and the broader [quality-gates](/autonomous/quality-gates) overview.
The intuitive fix for AI tells is to list your banned words in the generation prompt and trust the model to obey. It mostly works, which is exactly what makes it dangerous — "mostly" is not a standard you can ship autonomous content against. Three structural forces cause base models to violate prompt-level banned-word instructions, and none of them are fixable by writing the instruction more forcefully:
A deterministic output-time check defeats all three at once, because it does not depend on the model behaving correctly. It reads the finished output and asks a binary question: does this text contain a banned phrase, yes or no? Training-data bias, context drift, and the literal-ban-but-bad-synonym dodge are all invisible to a prompt and all caught by a regex run over the result. The gate works precisely because it stops trusting the model and starts checking it.
The gate is intentionally boring at the mechanism level — boring is what makes it deterministic. There is no second model judging the first, no probabilistic classifier with a confidence score to tune. It is a compiled regex set and a rejection loop. The sequence:
Two design choices in that loop are worth calling out. First, the gate runs after the fact-anchor gate, not before — fact-anchor failures trigger regeneration that produces different text, so checking brand safety first would waste compute on output that was about to change anyway. Second, the three-attempt cap exists because a banned phrase the model keeps reaching for usually signals a deeper conflict (often a Persona Brief that bans a word it also requires elsewhere), and an uncapped loop would spin forever instead of surfacing that conflict to a human.
Because the check is a regex pass and not a model call, the latency it adds is negligible — well under a tenth of a second per output on a typical list. The fact-anchor gate before it is the slow one; brand-safety is effectively free. That cost profile is why it can run on every single output without anyone thinking about it.
The brand-safety gate does not invent its own rules. Every phrase it enforces comes from the banned-word list inside your Persona Brief — the same brief that gates generation (Gate 1) and supplies the voice DNA every output is written against. This is the load-bearing connection: the banned-word section of the Persona Brief is not a nice-to-have field, it is the literal configuration that drives the strongest output-time gate in the autopilot stack. A thin banned-word list means a weak brand-safety gate, full stop.
This is also why the banned-word section tends to move output quality more than any other part of the Persona Brief. Voice DNA and reference posts nudge the model toward your style probabilistically; the banned-word list enforces specific exclusions deterministically. The phrases you hate — the AI tells, the industry cliches, the competitor names — get removed every time, not most of the time. Writing this section well is the single highest-leverage thing you can do to make autopilot output sound like you instead of like a language model.
The list should contain three categories of phrase, and a mature Persona Brief draws from all three:
| Banned-list category | Where it comes from | Example entries | How it changes over time |
|---|---|---|---|
| Universal AI tells | Standard AI-tells library, shared across all workspaces | leverage, delve, unlock, in today's fast-paced world, a testament to | Rotates slowly as models change and new tics emerge |
| Industry cliches | Your field's overused jargon | motivated seller (RE), synergize (SaaS), mindset shift (coaching) | Revised most often — field cliches age out fastest |
| Brand-specific bans | Your workspace only — no library has these | competitor names, regulated phrases, internal jargon, legal-flagged claims | Grows as legal and brand decisions accumulate |
A starter list of fifty to eighty phrases from the universal library plus twenty to thirty industry additions is enough to flip autopilot on. The list grows from there through the monthly tuning loop, typically settling around 150-250 phrases after six months of refinement. The point is not to ship the perfect list on day one — it is to ship a good-enough list and let the editing you do during the [manual ramp](/autonomous/manual-vs-autopilot-ramp) reveal the rest.
The single most common way a banned-word filter goes wrong is substring matching. Ban "tech" with a naive substring match and you also ban "technology," "architecture," and "biotech" — the gate rejects clean output over a fragment, the regeneration loop burns three attempts, and the output lands in review for no reason. The brand-safety gate avoids this with word-boundary regex, which matches whole words and phrases rather than any sequence of characters that happens to contain the banned string.
What word-boundary matching does and does not catch:
Matching is case-insensitive, which closes the obvious dodge: a model cannot sidestep the ban by capitalizing. "Leverage," "leverage," and "LEVERAGE" all match the same entry. The combination — case-insensitive plus word-boundary — is what makes the gate both thorough (no capitalization escape, all inflections caught) and precise (no substring false positives). When the gate misfires, it is almost always because a phrase was listed that is too short or too generic, not because the matching strategy failed. The fix is to lengthen the phrase or ban the specific construction rather than the fragment.
| Banned list entry | Matches | Does NOT match | Why |
|---|---|---|---|
| leverage | leverage, leveraging, leveraged | Levered, Leverington | Word-boundary catches inflections, skips embedded fragments |
| dive deep | dive deep, diving deep | deep dive, deepest dive | Phrase matched as ordered unit; reverse order needs its own entry |
| unlock | unlock, unlocking, unlocked | unlocker (rare), padlock | Inflections caught at boundary; unrelated words skipped |
| tech (too short — avoid) | tech, Tech, TECH | technology, architecture, biotech | Word-boundary saves it, but a 4-letter ban is high-risk — prefer a longer phrase |
| game-changer | game-changer, game changer | changer, game | Hyphen and space variants both caught; component words alone are safe |
When the gate rejects an output, the quality of the rewrite depends entirely on what the regeneration prompt tells the model. A generic "do not use leverage" is weak — it is the same prompt-level instruction that already failed, with no extra information. The gate does better by handing the model three things: the exact phrase that was caught, the meaning of the sentence it appeared in, and a direction to replace it with concrete language. A regeneration prompt that works looks like this:
“Your previous output contained the banned phrase "leverage." Regenerate the post without that phrase or any synonym for it. The original sentence meant: [paraphrase of the surrounding sentence]. Replace it with concrete, specific language rather than another abstract verb.”
Three elements make this rewrite reliably better than a bare instruction. Naming the exact phrase removes ambiguity about what failed. Supplying the original meaning prevents the model from dropping the idea entirely or mangling the sentence to route around one word. And the "concrete, specific language" direction steers the model away from the synonym-substitution trap — if you do not name a target, the model will often swap one AI tell for another. The regeneration prompt is doing real engineering work; it is not just a retry.
There is one failure case the regeneration prompt cannot fix on its own: a banned phrase that the Persona Brief also requires in its "required structures" section. The model is told to use a phrase and forbidden from using it at the same time, so every regeneration fails and the output exhausts its three attempts. The gate handles this by routing to review rather than looping forever, but the real fix is upstream — audit the brief for bans that conflict with requirements before they cost you three regenerations per output.
The universal AI-tells library is the same for everyone, but the industry layer is where most workspaces win or lose on voice. The cliches that mark content as generic are field-specific, and a model trained on the internet will reach for your industry's most overused phrases unless you ban them explicitly. Starting points by vertical — extend each with the phrases you personally keep cutting:
Two cautions on the industry layer. First, some of these phrases are only bad when overused — "passive income" is a legitimate real-estate term and a cliche when it shows up in every post. Ban the ones you are tired of seeing, not the entire vocabulary of your field. Second, the industry list is the part you should expect to revise most often, because field cliches rotate; the phrase everyone overused last year reads as dated this year, and a fresh AI tell takes its place. The monthly tuning loop is where that revision happens.
The brand-safety gate is deterministic about exactly one thing: the presence of a listed phrase. That makes it excellent at its job and useless outside it, and being honest about the boundary keeps you from over-trusting it. What it reliably catches:
What it cannot catch, because none of these is a listed-phrase problem:
The brand-safety gate is necessary, not sufficient. Paired with the [fact-anchor gate](/autonomous/fact-anchor-gate) ahead of it, the two deterministic gates catch the large majority of bad outputs before publishing — invented stats and fabricated quotes on the fact-anchor side, AI tells and banned phrases on the brand-safety side. What remains is the judgment layer: tone, framing, strategic fit. That is why even fully autonomous workspaces review aggregate metrics weekly. The gates handle the deterministic failures; a human handles the judgment ones.
Most brand-safety gate problems are configuration problems, not gate problems. The four that come up most often, and the fix for each:
The banned-word list is not a set-and-forget configuration. It is a living document that should converge over the first few months and then rotate slowly forever, because AI tells and industry cliches both drift. A monthly audit keeps it sharp without letting it bloat into the over-banning failure mode. The checklist:
One tuning principle overrides the rest: ban with a replacement in mind, not just a removal. Output sounds stilted when you strip words without giving the model an alternative direction — that is the over-banning failure in a different guise. The right move is to ban the phrase and let the regeneration prompt steer toward concrete language, so the result reads cleaner rather than emptier. A well-tuned brand-safety gate makes content sound more like you; a carelessly tuned one makes it sound like someone reading with a thesaurus held to their head. The difference is whether the bans came with replacements.
For where this gate sits in the full ordering, and how it hands off from the fact-anchor gate before it, see the [quality-gates](/autonomous/quality-gates) architecture overview. For how the banned-word list gets built in the first place — through the editing you do while still in the approval loop — see the [manual-vs-autopilot-ramp](/autonomous/manual-vs-autopilot-ramp) methodology. And for what the gate protects when you are fanning one source out across platforms, see [content repurposing](/repurpose). Kompozy ships the four-gate stack on its Creator and Pro tiers; see [pricing](/pricing).
Because prompt instructions are probabilistic and the gate is deterministic. A model overrides prompt-level banned-word rules about one time in five — through training-data bias, context drift, or swapping in an equally bad synonym. On a manual workflow a human catches that fifth case; on autopilot there is no human in the per-output loop, so an output-time regex check is the only thing that reliably catches what the prompt failed to prevent.
From the banned-word section of your Persona Brief — the same brief that supplies your voice DNA and reference posts. The gate compiles that list into a regex set and enforces it on every output. This is why the banned-word section is the highest-leverage field in the brief: a thin list means a weak gate, and a well-built list is what makes autopilot output sound like you instead of like a language model.
Only if you ban without replacing. Stilted output comes from stripping words and leaving a gap. The gate avoids this by feeding the regeneration prompt the original meaning and a direction toward concrete language, so the rewrite reads cleaner rather than emptier. Ban the phrase and steer the replacement, and a long list improves voice instead of flattening it.
A starter list of 50-80 phrases from the universal AI-tells library plus 20-30 industry-specific additions is enough to enable autopilot. Mature lists settle around 150-250 phrases after about six months of monthly tuning. The exact number matters less than the source: the phrases you keep editing out of real outputs are the ones that belong on the list.
It uses word-boundary matching, so it catches whole words and phrases rather than any character sequence containing the banned string. "leverage" matches "leveraging" but not "Leverington"; "dive deep" matches "diving deep" but not "deep dive" (reverse order needs its own entry). Matching is also case-insensitive so capitalization cannot dodge it. Short single-word bans are still the riskiest entries — prefer specific multi-word phrases where you can.
It regenerates up to three times, each attempt re-checked against the full list because a rewrite can introduce a different banned phrase than it removed. If it still contains a banned phrase after three attempts, it routes to manual review with the persistent phrase highlighted rather than looping forever or shipping dirty. A phrase that survives three regenerations usually signals a Persona Brief conflict — a word banned in one section and required in another — which is worth auditing directly.
Barely. It is a regex pass over the finished text, not a model call, so it adds well under a tenth of a second per output on a typical list. The fact-anchor gate that runs just before it — which matches claims against source material — is the slow gate; brand-safety is effectively free and can run on every output without anyone noticing the latency.
Some implementations do for safety-critical bans like hate speech or harmful content, where a moderation classifier adds value. For brand-style bans — AI tells, industry cliches, competitor names — regex is faster, fully controllable, and deterministic, which is exactly what you want for workspace-specific rules. The common pattern is regex for the brand-style list and a moderation API layered on only for the safety-critical category. VERIFY: any specific moderation API before relying on it for compliance-grade filtering.
← Back to Autonomous Content Creation overview · Get started →