// AUTONOMOUS CONTENT CREATION

Brand-safety gate: banned-word filtering for autonomous AI output

How a banned-word list applied at output-time (not just prompt-time) catches the AI tells that flag your content as AI. The architecture and tuning guide.

The direct answer

The brand-safety gate is a deterministic regex-based filter that checks every generated output against the banned-word list in your Persona Brief. If output contains any banned phrase, it is rejected and regeneration is triggered. This catches the 20% of failures where the model overrides prompt-level instructions and slips banned phrases through. After 3 regenerations, output routes to manual review.

Base models override prompt instructions surprisingly often. Telling the model "never use the word leverage" works ~80% of the time. The brand-safety gate catches the 20% where it slips through — which is the difference between autopilot you can trust and autopilot you cannot.

This post covers the architecture, the tuning process, and the failure modes.

Why prompt-level instructions are not enough

Three reasons base models violate banned-word instructions:

  1. Training-data bias. Models have seen "leverage" in millions of marketing documents. The pull toward common phrases overrides specific instructions.
  2. Context drift. As the prompt context grows (Persona Brief + source + generation instructions), specific banned-word rules get diluted relative to overall context weight.
  3. Synonym substitution. Models sometimes substitute banned words with near-synonyms that are equally bad ("dive deep" becomes "explore in depth").

A deterministic output-time check catches all three. Prompt instructions cannot.

How the brand-safety gate works

  1. Persona Brief banned-word list is parsed into a regex set (case-insensitive, word-boundary matched to avoid false positives on partial matches).
  2. After generation, every output is run through the regex set.
  3. Any match flags the output for regeneration. The flagged word is included in the regeneration prompt: "Output contained banned phrase: [phrase]. Regenerate without it."
  4. Regeneration runs up to 3 times. If still failing after 3 attempts, output routes to manual review with the persistent banned phrase highlighted.

What to include in the banned-word list

Three categories:

  1. AI tells (the universal list of 120+ phrases — see /brand-voice/banned-words for the full library)
  2. Industry-specific cliches (your industry-specific jargon and overused phrases)
  3. Brand-specific bans (terms you specifically want to avoid — competitor names, regulated phrases, internal jargon)

Regex matching strategy

The gate uses word-boundary regex matching to avoid false positives:

  • Banned phrase "leverage" matches: "We leverage AI" ✓, "Leveraging the platform" ✓
  • Does NOT match: "Levered buyout" ✗, "Leverington Street" ✗
  • Banned phrase "dive deep" matches: "Let us dive deep", "diving deep into"
  • Does NOT match: "deep dive" (intentional — the reverse order is allowed; ban "deep dive" separately if needed)

Case-insensitive matching prevents the model from sidestepping by capitalizing. "Leverage" and "LEVERAGE" both match.

Regeneration prompt engineering

When the gate triggers regeneration, the prompt to the model is specific:

Your previous output contained the banned phrase "leverage." Regenerate the post without using that phrase or any synonyms. The original meaning was: [paraphrase of the surrounding sentence]. Replace with concrete language that does not rely on the banned phrase.

This works better than a generic "do not use leverage" instruction because it gives the model context for what the original sentence meant and a target replacement direction.

Industry-specific banned-word patterns

SaaS / B2B tech

  • synergize
  • synergistic
  • circle back
  • low-hanging fruit
  • value-add
  • cross-functional
  • agile-first
  • data-driven (when overused)

Real estate

  • motivated seller
  • investment opportunity
  • wealth-building
  • passive income (when overused)
  • cash-flowing asset

Coaching / consulting

  • mindset shift
  • limiting beliefs
  • next-level mindset
  • breakthrough
  • transformative journey
  • unlock your potential

Health / wellness

  • wellness journey
  • holistic approach
  • natural solution
  • gentle yet effective

Common gate failures

  • Over-banning. Including too many phrases causes excessive regeneration. Watch for rejection rate above 25% post-ramp.
  • Under-banning. Missing common AI tells lets generic output ship. Audit edited outputs monthly — words you keep deleting belong in the list.
  • Conflicting bans. Banning a phrase that the Persona Brief also requires in "required structures" creates an infinite loop. Audit for conflicts.
  • Substring false positives. Banning "tech" matches "technology", "architecture", etc. Use word-boundary regex or longer phrase matching.

Tuning the gate over time

Monthly audit checklist:

  1. Review the last 30 shipped outputs. What edits did you make? Phrases you keep cutting belong in the banned list.
  2. Review the last 30 rejected outputs. What patterns? If the same phrase keeps triggering regeneration but eventually slips through, escalate strictness or add specific phrase variations.
  3. Check rejection rate trend. Should be declining over time as the model adapts to the rules. If increasing, something changed (Persona Brief update? New source type?). Investigate.
  4. Rotate the banned-word list every 6 months. Some phrases become outdated; new AI tells emerge. The list should evolve.

Integration with the fact-anchor gate

Brand-safety runs after fact-anchor. The order matters:

  • Fact-anchor: catches invented stats, fabricated quotes, hallucinated entities
  • Brand-safety: catches AI tells, banned phrases, brand-conflict words
  • Together: ~80% of bad outputs caught deterministically before publishing

What remains is the ~20% of failures that require editorial judgment — tone, framing, strategic alignment. Those still need human review on high-stakes content, but the gates handle the bulk.

Frequently asked questions

How many banned words should the list contain?

Mature lists have 150-250 phrases after 6 months of refinement. Starter list is 50-80 phrases from the universal AI-tells library plus 20-30 industry-specific additions.

Will banning many words make output sound stilted?

Only if you ban without replacing. The right approach is to ban the phrase and have the regeneration prompt suggest a replacement direction. Stilted output comes from removing words without giving the model an alternative path.

Can the gate use a denylist API instead of regex?

Some implementations use moderation APIs (OpenAI moderation endpoint, Anthropic's safety classifier). These add latency. Regex is faster and more controllable for brand-specific bans. Use APIs for safety-critical bans (hate speech, harmful content); regex for brand-style bans.

Does the gate work for non-English content?

Yes, with separate banned-word lists per language. Each language has its own AI tells. A multilingual workspace needs multiple lists, one per language.

How aggressive should I be with the list during the ramp?

Very aggressive. Over-banning during the 14-day ramp is fine — the rejection rate is high anyway, and you are learning what triggers actual problems. Trim down once the Persona Brief stabilizes.

Related guides in Autonomous Content Creation

Adjacent clusters

  • AI Brand Voice & PersonaWithout a Persona Brief, every AI output averages to the LLM default voice. This is the 5-section methodology that makes 100+ AI-generated posts feel like one human author wrote them.

← Back to Autonomous Content Creation overview · Start a free trial → · See pricing