// AUTONOMOUS CONTENT CREATION

The brand-safety gate: output-time banned-word filtering for autonomous AI content

Prompt-level instructions to avoid AI tells fail about one time in five. The brand-safety gate is the deterministic output-time filter that catches that 20% — the architecture, the regex strategy, the regeneration loop, how it reads from your Persona Brief banned-word list, and how to tune it without making your content sound stilted.

Last verified · 2026-06-18 · by Moe Ameen

The direct answer

The brand-safety gate is a deterministic, regex-based filter that checks every autopilot output against the banned-word list in your Persona Brief at output time — after generation, not just in the prompt. It uses case-insensitive, word-boundary matching so "leverage" is caught but "Leverington Street" is not. Any match rejects the output and triggers regeneration with the offending phrase fed back into the prompt. After three failed regenerations, the output routes to manual review with the persistent phrase flagged. The gate exists because base models override prompt-level banned-word instructions roughly one time in five, and a prompt cannot catch what it failed to prevent.

Telling a model "never use the word leverage" works about eighty percent of the time. The other twenty percent is the entire problem with prompt-only brand safety: the one output in five that slips a banned phrase through is the one your audience reads, and on autopilot there is no human approval step waiting to catch it. A prompt is an instruction the model can ignore. A gate is a check the model cannot route around.

The brand-safety gate is the fourth of the four autopilot quality gates, and it is the one that keeps your content from quietly drifting back into generic AI voice. It reads the banned-word list straight out of your Persona Brief, compiles it into a deterministic regex set, and runs every generated output through that set before anything ships. A match means rejection and regeneration — not a warning, not a soft flag, a hard pass-or-fail.

This spoke is the architecture and tuning guide for that gate: why output-time checking beats prompt-time instruction, exactly how the regex matching avoids false positives, what belongs in the banned-word list and what does not, the regeneration prompt engineering that makes the rewrite work, industry-specific banned-word patterns, and the monthly tuning loop that keeps the list sharp without over-banning your own voice into mush. It pairs with the [fact-anchor gate](/autonomous/fact-anchor-gate), which runs immediately before it, and the broader [quality-gates](/autonomous/quality-gates) overview.

Why prompt-level instructions are not enough

The intuitive fix for AI tells is to list your banned words in the generation prompt and trust the model to obey. It mostly works, which is exactly what makes it dangerous — "mostly" is not a standard you can ship autonomous content against. Three structural forces cause base models to violate prompt-level banned-word instructions, and none of them are fixable by writing the instruction more forcefully:

Training-data bias. The model has seen "leverage," "delve," and "in today's fast-paced world" in millions of marketing documents. The statistical pull toward the phrases that dominate its training data competes with your specific instruction, and on enough outputs the pull wins.
Context drift. As the prompt context grows — Persona Brief, source material, format instructions, reference posts — a single "do not use X" line gets diluted relative to the total weight of everything else in context. The longer the prompt, the weaker any one rule.
Synonym substitution. Even when the model obeys the literal ban, it sometimes swaps in a near-synonym that is just as bad. Ban "dive deep" and it writes "explore in depth." Ban "unlock" and it writes "unleash." The instruction was followed; the AI tell shipped anyway.

A deterministic output-time check defeats all three at once, because it does not depend on the model behaving correctly. It reads the finished output and asks a binary question: does this text contain a banned phrase, yes or no? Training-data bias, context drift, and the literal-ban-but-bad-synonym dodge are all invisible to a prompt and all caught by a regex run over the result. The gate works precisely because it stops trusting the model and starts checking it.

How the brand-safety gate works

The gate is intentionally boring at the mechanism level — boring is what makes it deterministic. There is no second model judging the first, no probabilistic classifier with a confidence score to tune. It is a compiled regex set and a rejection loop. The sequence:

The banned-word list from your Persona Brief is parsed into a regex set — case-insensitive, with word boundaries on each phrase to avoid partial-match false positives.
After generation (and after the fact-anchor gate has cleared the output), every output runs through the full regex set.
Any match flags the output for regeneration. The specific offending phrase is captured and injected into the regeneration prompt so the model knows exactly what to remove and what the surrounding sentence meant.
Regeneration runs up to three times. Each attempt is re-checked against the full set, because a rewrite can introduce a different banned phrase than the one it removed.
If the output still contains a banned phrase after three attempts, it routes to manual review with the persistent phrase highlighted, instead of shipping or looping forever.

Two design choices in that loop are worth calling out. First, the gate runs after the fact-anchor gate, not before — fact-anchor failures trigger regeneration that produces different text, so checking brand safety first would waste compute on output that was about to change anyway. Second, the three-attempt cap exists because a banned phrase the model keeps reaching for usually signals a deeper conflict (often a Persona Brief that bans a word it also requires elsewhere), and an uncapped loop would spin forever instead of surfacing that conflict to a human.

Because the check is a regex pass and not a model call, the latency it adds is negligible — well under a tenth of a second per output on a typical list. The fact-anchor gate before it is the slow one; brand-safety is effectively free. That cost profile is why it can run on every single output without anyone thinking about it.

The Persona Brief is the source of truth for the banned list

The brand-safety gate does not invent its own rules. Every phrase it enforces comes from the banned-word list inside your Persona Brief — the same brief that gates generation (Gate 1) and supplies the voice DNA every output is written against. This is the load-bearing connection: the banned-word section of the Persona Brief is not a nice-to-have field, it is the literal configuration that drives the strongest output-time gate in the autopilot stack. A thin banned-word list means a weak brand-safety gate, full stop.

This is also why the banned-word section tends to move output quality more than any other part of the Persona Brief. Voice DNA and reference posts nudge the model toward your style probabilistically; the banned-word list enforces specific exclusions deterministically. The phrases you hate — the AI tells, the industry cliches, the competitor names — get removed every time, not most of the time. Writing this section well is the single highest-leverage thing you can do to make autopilot output sound like you instead of like a language model.

The list should contain three categories of phrase, and a mature Persona Brief draws from all three:

Universal AI tells — the cross-industry phrases that mark text as machine-written: leverage, delve, unlock, navigate the complexities of, in today's fast-paced world, a testament to, and the long tail of LLM verbal tics. Start from a standard AI-tells library and add as you find them.
Industry cliches — the overused jargon specific to your field. Real estate has "motivated seller" and "passive income"; SaaS has "synergize" and "low-hanging fruit." These are the phrases that mark content as generic within your niche even when they are not generic AI tells.
Brand-specific bans — terms you personally want gone: competitor names, regulated phrases you are not cleared to use, internal jargon that should never reach the public, claims your legal team has flagged. These are unique to your workspace and no library will contain them.

Banned-list category	Where it comes from	Example entries	How it changes over time
Universal AI tells	Standard AI-tells library, shared across all workspaces	leverage, delve, unlock, in today's fast-paced world, a testament to	Rotates slowly as models change and new tics emerge
Industry cliches	Your field's overused jargon	motivated seller (RE), synergize (SaaS), mindset shift (coaching)	Revised most often — field cliches age out fastest
Brand-specific bans	Your workspace only — no library has these	competitor names, regulated phrases, internal jargon, legal-flagged claims	Grows as legal and brand decisions accumulate

The three categories of the Persona Brief banned-word list and how each evolves. The gate enforces all three identically; they differ only in source and refresh cadence.

A starter list of fifty to eighty phrases from the universal library plus twenty to thirty industry additions is enough to flip autopilot on. The list grows from there through the monthly tuning loop, typically settling around 150-250 phrases after six months of refinement. The point is not to ship the perfect list on day one — it is to ship a good-enough list and let the editing you do during the [manual ramp](/autonomous/manual-vs-autopilot-ramp) reveal the rest.

Regex matching strategy: catching the word, not the substring

The single most common way a banned-word filter goes wrong is substring matching. Ban "tech" with a naive substring match and you also ban "technology," "architecture," and "biotech" — the gate rejects clean output over a fragment, the regeneration loop burns three attempts, and the output lands in review for no reason. The brand-safety gate avoids this with word-boundary regex, which matches whole words and phrases rather than any sequence of characters that happens to contain the banned string.

What word-boundary matching does and does not catch:

Banned phrase "leverage" matches: "We leverage AI", "Leveraging the platform", "fully leveraged". It catches the word and its inflections at a word boundary.
Does NOT match: "Levered buyout", "Leverington Street". The banned fragment sits inside a different word, so the boundary check correctly skips it.
Banned phrase "dive deep" matches: "Let us dive deep", "diving deep into the data". The phrase is matched as a unit at word boundaries.
Does NOT match: "deep dive". The reverse word order is a different phrase — if you want to ban it too, add "deep dive" as its own entry. The gate matches what you list, not what you meant.

Matching is case-insensitive, which closes the obvious dodge: a model cannot sidestep the ban by capitalizing. "Leverage," "leverage," and "LEVERAGE" all match the same entry. The combination — case-insensitive plus word-boundary — is what makes the gate both thorough (no capitalization escape, all inflections caught) and precise (no substring false positives). When the gate misfires, it is almost always because a phrase was listed that is too short or too generic, not because the matching strategy failed. The fix is to lengthen the phrase or ban the specific construction rather than the fragment.

Banned list entry	Matches	Does NOT match	Why
leverage	leverage, leveraging, leveraged	Levered, Leverington	Word-boundary catches inflections, skips embedded fragments
dive deep	dive deep, diving deep	deep dive, deepest dive	Phrase matched as ordered unit; reverse order needs its own entry
unlock	unlock, unlocking, unlocked	unlocker (rare), padlock	Inflections caught at boundary; unrelated words skipped
tech (too short — avoid)	tech, Tech, TECH	technology, architecture, biotech	Word-boundary saves it, but a 4-letter ban is high-risk — prefer a longer phrase
game-changer	game-changer, game changer	changer, game	Hyphen and space variants both caught; component words alone are safe

Word-boundary, case-insensitive regex behavior on representative banned-list entries. Short single-word bans are the riskiest entries even with boundary matching — prefer specific multi-word phrases where possible.

Regeneration prompt engineering

When the gate rejects an output, the quality of the rewrite depends entirely on what the regeneration prompt tells the model. A generic "do not use leverage" is weak — it is the same prompt-level instruction that already failed, with no extra information. The gate does better by handing the model three things: the exact phrase that was caught, the meaning of the sentence it appeared in, and a direction to replace it with concrete language. A regeneration prompt that works looks like this:

“Your previous output contained the banned phrase "leverage." Regenerate the post without that phrase or any synonym for it. The original sentence meant: [paraphrase of the surrounding sentence]. Replace it with concrete, specific language rather than another abstract verb.”

— Brand-safety gate regeneration prompt, Autopilot gating layer

Three elements make this rewrite reliably better than a bare instruction. Naming the exact phrase removes ambiguity about what failed. Supplying the original meaning prevents the model from dropping the idea entirely or mangling the sentence to route around one word. And the "concrete, specific language" direction steers the model away from the synonym-substitution trap — if you do not name a target, the model will often swap one AI tell for another. The regeneration prompt is doing real engineering work; it is not just a retry.

There is one failure case the regeneration prompt cannot fix on its own: a banned phrase that the Persona Brief also requires in its "required structures" section. The model is told to use a phrase and forbidden from using it at the same time, so every regeneration fails and the output exhausts its three attempts. The gate handles this by routing to review rather than looping forever, but the real fix is upstream — audit the brief for bans that conflict with requirements before they cost you three regenerations per output.

Industry-specific banned-word patterns

The universal AI-tells library is the same for everyone, but the industry layer is where most workspaces win or lose on voice. The cliches that mark content as generic are field-specific, and a model trained on the internet will reach for your industry's most overused phrases unless you ban them explicitly. Starting points by vertical — extend each with the phrases you personally keep cutting:

SaaS / B2B tech

synergize
synergistic
circle back
low-hanging fruit
value-add
cross-functional
best-in-class
move the needle
mission-critical

Real estate

motivated seller
investment opportunity
wealth-building
passive income (when overused)
cash-flowing asset
build generational wealth

Coaching / consulting

mindset shift
limiting beliefs
next-level mindset
breakthrough
transformative journey
unlock your potential
show up as your best self

Health / wellness

wellness journey
holistic approach
natural solution
gentle yet effective
nourish your body
self-care ritual

Two cautions on the industry layer. First, some of these phrases are only bad when overused — "passive income" is a legitimate real-estate term and a cliche when it shows up in every post. Ban the ones you are tired of seeing, not the entire vocabulary of your field. Second, the industry list is the part you should expect to revise most often, because field cliches rotate; the phrase everyone overused last year reads as dated this year, and a fresh AI tell takes its place. The monthly tuning loop is where that revision happens.

What the gate catches and what it does not

The brand-safety gate is deterministic about exactly one thing: the presence of a listed phrase. That makes it excellent at its job and useless outside it, and being honest about the boundary keeps you from over-trusting it. What it reliably catches:

AI tells that slipped past the prompt — the one-in-five case where the model used a banned phrase despite being told not to.
Synonym-substitution AI tells, as long as the synonym is itself on the list. Ban both "dive deep" and "explore in depth" and the model has nowhere to hide.
Industry cliches you have explicitly banned, every time, not most of the time.
Brand-conflict words — competitor names, off-limits claims, internal jargon — wherever they appear in the output.

What it cannot catch, because none of these is a listed-phrase problem:

Tone misalignment the Persona Brief did not name. Output that uses zero banned words but lands in the wrong register passes the gate untouched.
A new AI tell you have not added to the list yet. The gate enforces the list; it does not discover phrases on its own. That is what the monthly audit is for.
Misleading framing or weak strategy. The gate checks words, not arguments. A post can be perfectly clean and still make the wrong point.
Subjective overclaim. "Best in class" can be banned as a phrase, but the gate has no way to judge whether a claim is too strong in context.

The brand-safety gate is necessary, not sufficient. Paired with the [fact-anchor gate](/autonomous/fact-anchor-gate) ahead of it, the two deterministic gates catch the large majority of bad outputs before publishing — invented stats and fabricated quotes on the fact-anchor side, AI tells and banned phrases on the brand-safety side. What remains is the judgment layer: tone, framing, strategic fit. That is why even fully autonomous workspaces review aggregate metrics weekly. The gates handle the deterministic failures; a human handles the judgment ones.

Common gate failures and how to fix them

Most brand-safety gate problems are configuration problems, not gate problems. The four that come up most often, and the fix for each:

Over-banning. Too many phrases on the list drives excessive regeneration — every output trips something, the loop burns attempts, and outputs pile up in review. Watch for a rejection rate above 25% post-ramp; if you see it, the list is too aggressive for your source material, not the gate misbehaving.
Under-banning. The opposite failure: a thin list lets generic output ship clean. The tell is that you keep editing the same phrases out of published posts by hand. Anything you repeatedly delete belongs on the list — that is the monthly audit's job.
Conflicting bans. A phrase banned in one section and required in another (a "required structure" the brief also forbids) creates a regeneration loop that always fails. Audit for these directly; they are invisible until an output keeps exhausting its three attempts.
Substring false positives. A too-short or too-generic entry — banning "tech" or "art" — risks catching clean words even with word-boundary matching, because some banned fragments are themselves whole words inside compounds. Prefer specific multi-word phrases; lengthen any single-word ban that misfires.

Tuning the gate over time

The banned-word list is not a set-and-forget configuration. It is a living document that should converge over the first few months and then rotate slowly forever, because AI tells and industry cliches both drift. A monthly audit keeps it sharp without letting it bloat into the over-banning failure mode. The checklist:

Review the last thirty shipped outputs. What did you edit before or after publishing? Every phrase you keep cutting by hand belongs on the list — your edits are the highest-quality source of new bans you have.
Review the last thirty rejected outputs. What patterns show up? If the same phrase keeps triggering regeneration but eventually slips through anyway, escalate it — add its variations and synonyms so the model has nowhere to route.
Check the rejection-rate trend. It should decline over time as the list stabilizes and the model adapts. If it is rising, something changed — a Persona Brief update, a new source type, a new format — and the cause is worth tracking down rather than tolerating.
Rotate stale entries every six months. Some banned cliches age out of relevance; new AI tells emerge as models change. Prune the phrases that no longer appear and add the ones that started showing up. The list should evolve, not just grow.

One tuning principle overrides the rest: ban with a replacement in mind, not just a removal. Output sounds stilted when you strip words without giving the model an alternative direction — that is the over-banning failure in a different guise. The right move is to ban the phrase and let the regeneration prompt steer toward concrete language, so the result reads cleaner rather than emptier. A well-tuned brand-safety gate makes content sound more like you; a carelessly tuned one makes it sound like someone reading with a thesaurus held to their head. The difference is whether the bans came with replacements.

For where this gate sits in the full ordering, and how it hands off from the fact-anchor gate before it, see the [quality-gates](/autonomous/quality-gates) architecture overview. For how the banned-word list gets built in the first place — through the editing you do while still in the approval loop — see the [manual-vs-autopilot-ramp](/autonomous/manual-vs-autopilot-ramp) methodology. And for what the gate protects when you are fanning one source out across platforms, see [content repurposing](/repurpose). Kompozy ships the four-gate stack on its Creator and Pro tiers; see [pricing](/pricing).

Frequently asked questions

Why is an output-time gate better than just listing banned words in the prompt?

Because prompt instructions are probabilistic and the gate is deterministic. A model overrides prompt-level banned-word rules about one time in five — through training-data bias, context drift, or swapping in an equally bad synonym. On a manual workflow a human catches that fifth case; on autopilot there is no human in the per-output loop, so an output-time regex check is the only thing that reliably catches what the prompt failed to prevent.

Where does the gate get its banned-word list?

From the banned-word section of your Persona Brief — the same brief that supplies your voice DNA and reference posts. The gate compiles that list into a regex set and enforces it on every output. This is why the banned-word section is the highest-leverage field in the brief: a thin list means a weak gate, and a well-built list is what makes autopilot output sound like you instead of like a language model.

Will banning a lot of words make my content sound stilted?

Only if you ban without replacing. Stilted output comes from stripping words and leaving a gap. The gate avoids this by feeding the regeneration prompt the original meaning and a direction toward concrete language, so the rewrite reads cleaner rather than emptier. Ban the phrase and steer the replacement, and a long list improves voice instead of flattening it.

How many banned words should the list contain?

A starter list of 50-80 phrases from the universal AI-tells library plus 20-30 industry-specific additions is enough to enable autopilot. Mature lists settle around 150-250 phrases after about six months of monthly tuning. The exact number matters less than the source: the phrases you keep editing out of real outputs are the ones that belong on the list.

How does the regex avoid false positives like matching "tech" inside "technology"?

It uses word-boundary matching, so it catches whole words and phrases rather than any character sequence containing the banned string. "leverage" matches "leveraging" but not "Leverington"; "dive deep" matches "diving deep" but not "deep dive" (reverse order needs its own entry). Matching is also case-insensitive so capitalization cannot dodge it. Short single-word bans are still the riskiest entries — prefer specific multi-word phrases where you can.

What happens when an output keeps failing the gate?

It regenerates up to three times, each attempt re-checked against the full list because a rewrite can introduce a different banned phrase than it removed. If it still contains a banned phrase after three attempts, it routes to manual review with the persistent phrase highlighted rather than looping forever or shipping dirty. A phrase that survives three regenerations usually signals a Persona Brief conflict — a word banned in one section and required in another — which is worth auditing directly.

Does the brand-safety gate slow down generation?

Barely. It is a regex pass over the finished text, not a model call, so it adds well under a tenth of a second per output on a typical list. The fact-anchor gate that runs just before it — which matches claims against source material — is the slow gate; brand-safety is effectively free and can run on every output without anyone noticing the latency.

Can the gate use a moderation API instead of regex?

Some implementations do for safety-critical bans like hate speech or harmful content, where a moderation classifier adds value. For brand-style bans — AI tells, industry cliches, competitor names — regex is faster, fully controllable, and deterministic, which is exactly what you want for workspace-specific rules. The common pattern is regex for the brand-style list and a moderation API layered on only for the safety-critical category. VERIFY: any specific moderation API before relying on it for compliance-grade filtering.

Adjacent clusters

AI Brand Voice & Persona — Without a Persona Brief, every AI output averages to the LLM default voice. This is the 5-section methodology that makes 100+ AI-generated posts feel like one human author wrote them.

← Back to Autonomous Content Creation overview · Get started →