When AI Guardrails Backfire

Pigeon standing in a corridor lined with crossed-out signs for goblins, gremlins, raccoons, trolls, ogres, and pigeons

The things we add to AI systems to make them safer can make them worse.

I know. The standard advice runs in the opposite direction. Add guardrails. Tune the reward signals. Write system prompts that constrain the output. Build controls.

By “guardrails” I mean the control layer around the model: training-time reward signals, system prompts, caching logic, reasoning budgets, model routing, safety classifiers, A/B test buckets (product experiments that send different users through different configurations), and vendor wrappers. Some shape the model before deployment. Others sit between the user and the model at runtime. Together, they determine what the user actually experiences. Most users never see any of it. Most users do not know it exists.

But in April 2026, two of the largest AI companies published postmortems that tell the same story from different angles. OpenAI spent months chasing a bug that no evaluation caught: their models kept talking about goblins. Anthropic disclosed that three separate “improvements” to Claude Code degraded quality for 47 days without Anthropic’s internal evaluation and monitoring systems initially isolating the root causes.

These are not isolated curiosities. They expose a recurring failure mode in AI governance. And the people who bear the consequences of these failures, the users, cannot see the controls, cannot modify them, and have no idea when they change.

I have been writing about this pattern for a few months now. In “Don’t tell AI what time it is,” I showed that injecting timestamps into AI prompts for audit trail purposes drops accuracy by about 10%. The compliance mechanism undermines the output being audited. In “Is my LLM getting dumber, or is it just me?,” I documented a U-shaped accuracy curve across GPT-4 variants: 86% for the original, 46.5% for GPT-4-turbo, then recovery to 95.3% with GPT-5.1. The model changed repeatedly. Nobody was told.

The first two posts were about prompts and model versions. This one is about the invisible control layer in between.

. . .

OpenAI’s goblin problem started with a feature called personality presets, launched alongside GPT-5.1 in November 2025. One of the eight presets was “Nerdy,” which used a system prompt telling the model to “undercut pretension through playful use of language” and acknowledge that “the world is complex and strange.”

That is a vibe, not an instruction. The model found its own way to satisfy it.

It started inserting references to goblins, gremlins, and other creatures into its metaphors. A user asking about database optimization might get an answer about “gremlins in your query.” A coding question might come back with a reference to “little goblins” hiding in the logic. One or two instances were harmless, even charming. But across model generations, the habit multiplied. The goblins kept showing up in contexts where nobody had asked for them.

During reinforcement learning from human feedback (RLHF) training, a reward signal designed to encourage the Nerdy personality scored outputs containing creature metaphors higher than outputs without them. This happened in 76.2% of all training datasets audited. The Nerdy personality accounted for only 2.5% of ChatGPT responses, but 66.7% of all goblin mentions came from that 2.5%.

Here is the part that matters for governance. The behavior transferred.

OpenAI’s own analysis shows that reinforcement learning did not keep the rewarded style tic confined to the Nerdy condition. The model learned “creature metaphors get higher reward scores” as a general pattern. As goblin mentions increased under the Nerdy prompt, they increased by nearly the same relative proportion in outputs generated without the Nerdy prompt. The contamination was proportional.

OpenAI retired the Nerdy personality in March 2026, removed the goblin-affine reward signal, and filtered creature-word training data. But GPT-5.5 had already begun training on data containing the contaminated outputs, so the goblins persisted. The fix for the Codex coding tool was blunt: a system prompt instruction saying “Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.”

Read that list again. Goblins. Gremlins. Raccoons. Trolls. Ogres. Pigeons. The fact that the ban had to extend to pigeons tells you how far the contamination spread from its original creature-metaphor seed.

That instruction was repeated multiple times in the same system prompt. The repetition suggests the engineers did not expect one prohibition to be enough.

They had reason for that instinct. A system prompt is not a database constraint. It is one instruction among many tokens the model must reconcile: the user’s message, prior conversation turns, tool outputs, retrieved documents, examples, and other product instructions. Long conversations create more chances for dilution, conflict, or override. Repetition can make a constraint more salient, but it does not turn the constraint into a hard rule.

A direct prohibition on a concrete noun like “goblin” is about as strong as negative instructions get. If that requires repetition to hold, softer instructions like “maintain professional tone” or “do not overstate environmental commitments” are far less reliable than compliance teams assume. The instruction you wrote into the system prompt and the instruction the model actually follows may not be the same thing, especially late in a long conversation.

. . .

The Anthropic postmortem tells a different version of the same story. Three changes shipped to Claude Code over 47 days, each affecting a different slice of users on a different schedule.

On March 4, the default reasoning effort for Claude Code was lowered from “high” to “medium.” This is a test-time compute tradeoff. In general, longer reasoning improves output quality, but it increases latency and cost. Reducing the default saved both. It also made the model noticeably less intelligent. The change was reverted on April 7 after 33 days.

On March 26, a caching optimization was introduced. The design was simple: if a coding session had been idle for more than an hour, clear old reasoning history to reduce the cost of resuming. The implementation used the clear_thinking_20251015 API header with keep:1. It had a bug. Instead of clearing old reasoning once on session resume, it cleared reasoning on every subsequent turn for the rest of the session. Claude kept executing tool calls and writing code, but it could no longer see why it had made earlier decisions. Users reported forgetfulness, repetition, and odd tool choices. The bug was fixed on April 10, 15 days later.

On April 16, a single line was added to the system prompt: “Length limits: keep text between tool calls to ≤25 words. Keep final responses to ≤100 words unless the task requires more detail.” This passed multiple weeks of internal testing with no regressions on the standard evaluation suite. After the incident, Anthropic ran a broader isolation test: remove that one line, compare model performance with and without it, and measure the difference. That test found a 3% intelligence drop for both Opus 4.6 and 4.7. The line was reverted on April 20, four days later.

Because each change hit different traffic on a different schedule, the aggregate effect looked like broad, inconsistent degradation. Users complained. Internal monitoring did not reproduce the issues. It took weeks for each problem to be traced to its root cause.

The verbosity prompt is worth pausing on because it mirrors my timestamp finding exactly. In my experiment, irrelevant context (time of day) triggered conversational completions that broke exact-match evaluation pipelines. The model knew the answer but expressed it in a way that defeated automated extraction. In Anthropic’s case, a brevity constraint changed how the model allocated its token budget, and the standard evaluation suite did not measure the dimension that degraded. Both are cases where a prompt-level change altered the output distribution without touching the model weights.

. . .

My LLM-getting-dumber post showed that models change without notice. This section asks a different question: why don’t the evaluations designed to catch those changes actually work?

Standard AI evaluations are point-in-time benchmarks. They ask: does the model pass this test today? They do not ask: has the model’s behavior changed across real usage over weeks or months?

Chen, Zaharia, and Zou documented this gap in their 2023 Stanford/UC Berkeley study (later published in the Harvard Data Science Review). Their original July 2023 preprint reported that GPT-4’s accuracy on prime classification dropped from 97.6% to 2.4% over three months on identical prompts. Those numbers were revised in the published version to 84.0% and 51.1% after the authors moved from the original 500-question prime-number setup to a broader 1,000-question prime-versus-composite dataset. The directional finding survived: the same model, on the same task family, produced materially different results three months apart. The revision matters because it shows that even measuring drift requires careful methodology. I replicated a version of this finding in my own benchmark, where all GPT-4 family models scored only 38% on a similar chain-of-thought prime subset.

LLMEval-3, a longitudinal study published in 2025 covering nearly 50 models over 20 months, found that dynamic rankings diverge significantly from static benchmarks. The models that score well on a fixed test set are not always the models that perform well in production over time. Static benchmarks also suffer from data contamination: models can memorize test cases and “pass” without generalizing.

The Anthropic postmortem is the cleanest illustration of the gap. Their verbosity prompt passed weeks of testing on the standard suite. Only a broader post-complaint test that compared runs with and without the prompt line caught the 3% drop. The standard suite tested the wrong things.

For compliance teams in financial services, audit, and other regulated settings, this means that validating an AI tool at deployment is not enough. The tool can degrade after deployment because the provider changed something upstream, and your validation results become stale the moment the provider ships an update. This is the same point I made in the LLM-getting-dumber post: the control “We use AI to review contracts” is incomplete without “and we monitor the AI for behavioral changes.”

. . .

In these cases, users were the leading signal. They saw the degradation before internal monitoring isolated the root cause.

In the Anthropic case, an AMD senior director published a 6,852-session, 234,760-tool-call audit on GitHub. She called Claude Code “unusable for complex engineering tasks.” Anthropic’s postmortem explicitly admits that internal evaluations and usage did not initially reproduce the issues.

In the OpenAI case, users publicly surfaced the goblin pattern before OpenAI traced it to the Nerdy reward signal.

In December 2023, GPT-4’s “laziness” was caught by users online before OpenAI acknowledged the pattern publicly. OpenAI’s response is worth remembering: “Only a subset of prompts may be degraded, and it may take a long time for customers and employees to notice and fix these patterns.”

That sentence should worry anyone using AI in a regulated workflow. If your firm uses an AI tool for credit review, fraud screening, transaction monitoring, tax classification, or audit evidence, and the provider ships a model update that shifts accuracy, who catches it? Not the provider, based on the evidence above. Not your compliance team, unless they are running continuous tests against production. Probably the front-line analyst who notices the output looks different one Tuesday morning and mentions it to a colleague. Not because of a procedure. Because of a feeling.

That is not a control. That is luck.

. . .

But the detection problem is only half of it. The deeper issue is that users have no control over the thing that changed.

Most hosted AI chat products and agent tools run with upstream system or developer instructions. This is the hidden instruction set that shapes how the model behaves before you type a single word. It defines the model’s persona, its constraints, its formatting defaults, its refusal thresholds. When OpenAI told its model to be “playfully nerdy,” that was a system prompt. When Anthropic added “keep text between tool calls to 25 words or fewer,” that was a system prompt change.

But system prompts are only the most visible hidden layer. Your request may also pass through model routing that sends it to different model variants based on task type, product policy, rate limits, or routing logic you do not control. Safety classifiers may block, filter, route, or alter the response pipeline before you see the final output. Those product experiments may place you in different prompt or configuration buckets without your knowledge. The provider’s control stack is not one layer. It is several, and none of them are visible to you.

You may receive a general release note if the provider publishes one. Both Anthropic and OpenAI do publish some system prompt and model updates. But you usually do not receive a workflow-specific notice that says: this hidden instruction, routing rule, classifier, cache policy, or reasoning setting changed, and here is how it affects the use case you validated.

On the API, you have more control. You can write your own system prompt. But even there, the provider may maintain product-level instructions that interact with yours. Anthropic’s documentation confirms that system prompt updates for its web and mobile apps do not apply to the API, but for products like Claude Code, the provider’s own instructions sit in the context alongside whatever the user configures. The boundaries depend on the product, and they are not always visible to the user.

This is the part that should make compliance professionals uncomfortable. The system prompt is one of the most powerful determinants of model behavior short of changing the model or its runtime configuration. It shapes tone, formatting, verbosity, reasoning depth, refusal patterns, and the likelihood of specific outputs. When Anthropic’s verbosity prompt caused a 3% intelligence drop, that was one line. One line that users could not see, could not remove, and did not know had been added.

For enterprises using AI through a vendor layer like Microsoft Copilot, the opacity compounds. The vendor wraps the provider’s instructions in its own instructions. You are now three or four layers removed from the controls that govern your tool’s behavior: the provider’s hidden system prompt, the provider’s routing and classifiers, the vendor’s wrapper, and possibly the vendor’s own routing. If any layer changes, your output changes. You validated the tool in January. The provider changed their hidden prompt in March. The vendor changed their wrapper in April. Your January validation is meaningless, and you have no way to know that.

From an audit perspective, this strains a basic assumption. International Standard on Auditing (ISA) 315 already contemplates IT change risk. It tells auditors to understand the entity’s IT environment, including service providers, system changes, system failures, and IT-related risks. But that material was written for IT applications and general IT controls that can usually be inventoried, documented, and tested. It does not tell auditors what to do when a probabilistic AI tool changes because a provider modifies a hidden prompt, routing rule, reasoning budget, cache policy, or agent wrapper upstream. ISA 500, the audit-evidence standard, requires evidence to be sufficient and appropriate, but says nothing about what “sufficient” means when the tool producing that evidence has an invisible, mutable control layer. The standards do not need to be thrown out. They need practical guidance for a class of IT risk they were not designed to address.

In my timestamp experiment, I controlled the prompt entirely. I could measure the effect of adding one sentence because I knew exactly what the model received. In production, nobody has that visibility. The prompt you think the model received and the prompt it actually received may differ in ways that affect your output, and you have no mechanism to detect the difference.

. . .

No regulator has issued a binding rule addressing the upstream-update problem: a provider ships a change that alters model behavior in your workflow without advance notice.

The UK Financial Conduct Authority (FCA), the financial-services regulator, comes closest. In its long-term review speech on AI in financial services, it asked the right question: “What does ‘reasonable steps’ look like when the model you rely on updates weekly, incorporates components you don’t directly control, or behaves differently as soon as new data arrives?” The UK’s Critical Third Parties regime gives HM Treasury the power to designate AI providers, but no AI provider has been designated yet.

The EU AI Act creates documentation and lifecycle duties for general-purpose AI providers, but it is not well tailored to the product-layer change problem discussed here. A one-line prompt change or caching optimization may affect downstream behavior without looking like a new or significantly modified general-purpose AI (GPAI) model. The regulatory machinery was designed for model-level changes, not for the invisible prompt and configuration drift that caused both the goblin and Claude Code incidents.

The European Securities and Markets Authority (ESMA) has the strongest existing language. Its May 2024 statement applies AI governance expectations under MiFID II, the EU Markets in Financial Instruments Directive. It requires firms to track “any modifications made over time” and maintain testing and monitoring systems. But this applies to investment firms, not audit firms or financial-statement preparers.

The Committee of Sponsoring Organizations of the Treadway Commission (COSO), which many accountants know from internal-control frameworks, issued February 2026 guidance on internal control over generative AI. It is the most practically relevant document for accountants. It identifies model drift and frequent configuration changes as risks requiring monitoring activities and re-validation. But it is a framework, not a standard, and adoption is voluntary.

In Singapore, the Monetary Authority of Singapore (MAS) has Project MindForge and a 2025 consultation on AI Risk Management Guidelines. Both propose lifecycle monitoring expectations that contemplate generative AI specifically. The guidance targets banks and financial institutions, not the broader accounting profession.

But even if every regulator on this list closed their gaps tomorrow, the deeper problem would remain. Most regulated firms do not have the technical infrastructure to run longitudinal behavioral tests on upstream models. They would need machine-learning operations tooling, often called ML-ops, that monitors outputs over time, compares them against baselines, and flags drift before it reaches production workpapers. That capability does not exist at most firms today. The regulatory gap is real, but it is a symptom. The capability gap is the disease.

. . .

I want to be careful about what I am not arguing. The answer is not to remove guardrails. Untuned models are not safer, more reliable, or more auditable. The point is narrower: a guardrail is itself a system change. It needs versioning, regression testing, disclosure, rollback criteria, and monitoring. A hidden control is still a dependency.

Enterprise users can sometimes pin model versions, negotiate service-level agreements (SLAs), obtain System and Organization Controls (SOC) reports, and run canary evaluations, small tests before a full rollout. These help. But they do not solve hidden product-layer instructions, agent wrappers, model routing, caching changes, and reasoning-budget changes unless the vendor explicitly discloses and versions them. The gap is between what is contractually promised and what actually governs the model’s behavior on any given Tuesday.

. . .

Here is what ties the three posts together.

The timestamp post showed that a compliance input can degrade output quality. The LLM-getting-dumber post showed that the model behind the API changes without notice and nobody tells you. This post shows that the guardrails themselves, the reward signals, system prompts, caching optimizations, routing logic, and reasoning constraints that are supposed to make AI safer, can create the problems they were designed to prevent.

Every time you add a control to an AI system, you change the system. And you do not get to choose how.

The profession that invented control testing should be the first to say it clearly: a control you cannot inspect, version, or retest is not a control. It is a dependency. That framing needs to change before the next model update ships.

. . .

For related findings on prompt sensitivity and model drift, see: “Don’t tell AI what time it is” and “Is my LLM getting dumber, or is it just me?”

Sources cited: OpenAI, “Where the goblins came from” (April 2026). Anthropic, “An update on recent Claude Code quality reports” (April 23, 2026). Chen, Zaharia, and Zou, “How Is ChatGPT’s Behavior Changing over Time?” (Harvard Data Science Review, 2024). COSO, “Achieving Effective Internal Control Over Generative AI” (February 2026). LLMEval-3, “A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models” (arXiv, 2025). AMD Claude Code audit, “Claude Code is unusable for complex engineering tasks with the Feb updates” (GitHub, April 2026). FCA, “The FCA’s long term review into AI and retail financial services: designing for the unknown” (2025). EU GPAI Guidelines, “Guidelines for providers of general-purpose AI models” (2025). ESMA, “Public Statement on AI and investment services” (May 2024). ISA 315 (Revised 2019), ISA 500.

Back to blog