Brand-safety gate: banned-word filtering for autonomous AI output
How a banned-word list applied at output-time (not just prompt-time) catches the AI tells that flag your content as AI. The architecture and tuning guide.
The direct answer
The brand-safety gate is a deterministic regex-based filter that checks every generated output against the banned-word list in your Persona Brief. If output contains any banned phrase, it is rejected and regeneration is triggered. This catches the 20% of failures where the model overrides prompt-level instructions and slips banned phrases through. After 3 regenerations, output routes to manual review.
Base models override prompt instructions surprisingly often. Telling the model "never use the word leverage" works ~80% of the time. The brand-safety gate catches the 20% where it slips through — which is the difference between autopilot you can trust and autopilot you cannot.
This post covers the architecture, the tuning process, and the failure modes.
Why prompt-level instructions are not enough
Three reasons base models violate banned-word instructions:
Training-data bias. Models have seen "leverage" in millions of marketing documents. The pull toward common phrases overrides specific instructions.
Context drift. As the prompt context grows (Persona Brief + source + generation instructions), specific banned-word rules get diluted relative to overall context weight.
Synonym substitution. Models sometimes substitute banned words with near-synonyms that are equally bad ("dive deep" becomes "explore in depth").
A deterministic output-time check catches all three. Prompt instructions cannot.
How the brand-safety gate works
Persona Brief banned-word list is parsed into a regex set (case-insensitive, word-boundary matched to avoid false positives on partial matches).
After generation, every output is run through the regex set.
Any match flags the output for regeneration. The flagged word is included in the regeneration prompt: "Output contained banned phrase: [phrase]. Regenerate without it."
Regeneration runs up to 3 times. If still failing after 3 attempts, output routes to manual review with the persistent banned phrase highlighted.
What to include in the banned-word list
Three categories:
AI tells (the universal list of 120+ phrases — see /brand-voice/banned-words for the full library)
Industry-specific cliches (your industry-specific jargon and overused phrases)
Brand-specific bans (terms you specifically want to avoid — competitor names, regulated phrases, internal jargon)
Regex matching strategy
The gate uses word-boundary regex matching to avoid false positives:
Does NOT match: "Levered buyout" ✗, "Leverington Street" ✗
Banned phrase "dive deep" matches: "Let us dive deep", "diving deep into"
Does NOT match: "deep dive" (intentional — the reverse order is allowed; ban "deep dive" separately if needed)
Case-insensitive matching prevents the model from sidestepping by capitalizing. "Leverage" and "LEVERAGE" both match.
Regeneration prompt engineering
When the gate triggers regeneration, the prompt to the model is specific:
Your previous output contained the banned phrase "leverage." Regenerate the post without using that phrase or any synonyms. The original meaning was: [paraphrase of the surrounding sentence]. Replace with concrete language that does not rely on the banned phrase.
This works better than a generic "do not use leverage" instruction because it gives the model context for what the original sentence meant and a target replacement direction.
Industry-specific banned-word patterns
SaaS / B2B tech
synergize
synergistic
circle back
low-hanging fruit
value-add
cross-functional
agile-first
data-driven (when overused)
Real estate
motivated seller
investment opportunity
wealth-building
passive income (when overused)
cash-flowing asset
Coaching / consulting
mindset shift
limiting beliefs
next-level mindset
breakthrough
transformative journey
unlock your potential
Health / wellness
wellness journey
holistic approach
natural solution
gentle yet effective
Common gate failures
Over-banning. Including too many phrases causes excessive regeneration. Watch for rejection rate above 25% post-ramp.
Under-banning. Missing common AI tells lets generic output ship. Audit edited outputs monthly — words you keep deleting belong in the list.
Conflicting bans. Banning a phrase that the Persona Brief also requires in "required structures" creates an infinite loop. Audit for conflicts.
Substring false positives. Banning "tech" matches "technology", "architecture", etc. Use word-boundary regex or longer phrase matching.
Tuning the gate over time
Monthly audit checklist:
Review the last 30 shipped outputs. What edits did you make? Phrases you keep cutting belong in the banned list.
Review the last 30 rejected outputs. What patterns? If the same phrase keeps triggering regeneration but eventually slips through, escalate strictness or add specific phrase variations.
Check rejection rate trend. Should be declining over time as the model adapts to the rules. If increasing, something changed (Persona Brief update? New source type?). Investigate.
Rotate the banned-word list every 6 months. Some phrases become outdated; new AI tells emerge. The list should evolve.
Integration with the fact-anchor gate
Brand-safety runs after fact-anchor. The order matters:
Brand-safety: catches AI tells, banned phrases, brand-conflict words
Together: ~80% of bad outputs caught deterministically before publishing
What remains is the ~20% of failures that require editorial judgment — tone, framing, strategic alignment. Those still need human review on high-stakes content, but the gates handle the bulk.
Frequently asked questions
How many banned words should the list contain?
Mature lists have 150-250 phrases after 6 months of refinement. Starter list is 50-80 phrases from the universal AI-tells library plus 20-30 industry-specific additions.
Will banning many words make output sound stilted?
Only if you ban without replacing. The right approach is to ban the phrase and have the regeneration prompt suggest a replacement direction. Stilted output comes from removing words without giving the model an alternative path.
Can the gate use a denylist API instead of regex?
Some implementations use moderation APIs (OpenAI moderation endpoint, Anthropic's safety classifier). These add latency. Regex is faster and more controllable for brand-specific bans. Use APIs for safety-critical bans (hate speech, harmful content); regex for brand-style bans.
Does the gate work for non-English content?
Yes, with separate banned-word lists per language. Each language has its own AI tells. A multilingual workspace needs multiple lists, one per language.
How aggressive should I be with the list during the ramp?
Very aggressive. Over-banning during the 14-day ramp is fine — the rejection rate is high anyway, and you are learning what triggers actual problems. Trim down once the Persona Brief stabilizes.
AI Brand Voice & Persona — Without a Persona Brief, every AI output averages to the LLM default voice. This is the 5-section methodology that makes 100+ AI-generated posts feel like one human author wrote them.