AI Prompt Engineering Claude

Anthropic Just Built Contract-First Prompting Into Their Safety Infrastructure

Martin Rojas
7 min read
Anthropic Just Built Contract-First Prompting Into Their Safety Infrastructure

Anthropic's new Claude Code auto mode validates Contract-First Prompting by using fixed templates and customizable slots to enforce safety boundaries in agents.

Anthropic published a detailed engineering post this week on Claude Code’s new auto mode — a system that decides, on the agent’s behalf, which actions are safe to execute without human approval. The architecture is worth studying, not because of the safety angle alone, but because it validates a pattern that production-focused teams are already using: define the contract first, then let the implementation run within those boundaries.

If you’ve been following this blog, you know Contract-First Prompting. Define the interface — types, constraints, success criteria — before asking the AI to write a single line of implementation. Get agreement on what “correct” looks like, then execute. The thesis is simple: most AI failures aren’t model failures. They’re specification failures.

Anthropic’s classifier architecture tells the same story, just applied to agent permissions instead of code generation.

The Architecture, Through a Contract Lens

Auto mode’s transcript classifier is built on a prompt with two distinct layers: a fixed template and three customizable slots.

The fixed template handles the interpretive work — how to evaluate what an action actually does, how to read user authorization, and the step-by-step classification process. This doesn’t change per user. It’s the contract schema.

The three customizable slots define your specific policy:

  • Environment — your trust boundary. Which domains, repos, and cloud buckets count as “inside.”
  • Block rules — what categories of action to deny. Force-pushes, credential scanning, production deploys, data exfiltration.
  • Allow exceptions — mandatory carve-outs where a block rule technically matches but the action is benign. Installing packages from your manifest, pushing to your working branch.

This is separation of concerns applied to safety. The fixed template is the interface definition. The slots are the implementation details that vary per consumer. Same pattern, different domain.

Why This Matters More Than the Safety Story

The interesting design decision isn’t “we built a classifier.” It’s how the classifier evaluates actions.

From the post: the evaluation rules tell the classifier to assess the real-world impact of an action, not the surface text of the invocation. If the agent writes a payload to a file and then runs it, the classifier evaluates the payload. If commands are chained with &&, the whole chain is one action.

This is output validation against a contract, not string matching against a blocklist. The classifier doesn’t ask “does this command look dangerous?” It asks “does the real-world effect of this command fall within the boundaries the user authorized?”

That’s the same shift Contract-First Prompting makes for code generation. You don’t validate that the AI’s output looks like a React component. You validate that it satisfies the TypeScript interface, passes the test cases, and respects the prop contract you defined upfront.

The Overeager Agent Problem Is a Specification Problem

Anthropic’s internal incident log is revealing. The dangerous actions they document aren’t adversarial — they’re helpful. An agent that deletes remote git branches because the user said “clean up old branches.” An agent that greps through environment variables looking for API tokens because it hit an auth error. An agent that picks the closest job by name similarity and deletes it because the user said “cancel my job.”

Every one of these is the agent doing exactly what we ask AI to do: take initiative, solve problems, fill in the gaps. The failure isn’t intelligence. It’s the gap between what was said and what was meant.

The classifier’s user intent rules address this directly. The prompt establishes that what the user explicitly authorized is the boundary — everything the agent infers on its own is unauthorized until confirmed. “Clean up my branches” doesn’t authorize a batch delete. “Can we fix this?” is a question, not a directive.

This is contract enforcement. The contract is what the user said. Anything outside that contract requires a new agreement.

The Two-Stage Pipeline: Fail Fast, Reason When It Matters

The classifier runs in two stages. Stage 1 is a single-token yes/no filter tuned to err toward blocking. Stage 2 runs chain-of-thought reasoning, but only on the actions Stage 1 flags.

The performance numbers: Stage 1 alone catches actions with an 8.5% false positive rate. Adding Stage 2 drops that to 0.4%, at the cost of letting some genuinely dangerous actions through (17% false negative rate on their overeagerness test set).

If you’ve worked with contract validation in code, this pattern is familiar. Schema validation is cheap — check the shape first. Deep validation (business logic, cross-field dependencies, integration checks) is expensive, so you only run it on inputs that pass the shape check. You wouldn’t run a full integration test suite on a payload that fails type checking.

The same economics apply here. Most actions are obviously fine. Spending reasoning tokens on git status is waste. Spending them on git push --force is the whole point.

The Reasoning-Blind Design

The most counterintuitive decision: the classifier never sees the agent’s own reasoning. It sees user messages and tool call commands. That’s it. The agent’s explanations, its chain of thought, its justifications — all stripped.

Why? Because the agent can talk the classifier into a bad decision. “This is safe because the user implicitly approved it earlier” is a convincing rationalization that might be wrong. The classifier should evaluate what the agent does, not what the agent says about what it does.

This maps to a principle any engineer who’s reviewed PRs recognizes. You review the diff, not the PR description. The description tells you what the author intended. The diff tells you what they shipped. When those diverge, you trust the diff.

For Contract-First Prompting, the equivalent is: validate against the contract, not against the conversation. The AI might explain eloquently why its output is correct. The TypeScript compiler doesn’t care about explanations.

What This Means for Your AI Workflows

Three takeaways worth applying, whether you’re building agents or writing prompts:

Separate the fixed contract from the variable policy. Anthropic’s template ships the same evaluation logic to every user. Only the trust boundaries and block rules change. When you’re building prompt systems for your team, identify what’s universal (the methodology, the output format, the validation criteria) and what varies per context (the tech stack, the repo conventions, the team preferences). Ship the contract. Let the implementation details be configurable.

Validate the output, not the intent. The classifier doesn’t ask the agent why it’s running a command. It evaluates the real-world effect. When you’re reviewing AI-generated code, don’t read the explanation first. Run the types. Run the tests. Check the contract. Then read the explanation if something doesn’t pass.

Define authorization boundaries explicitly. The 17% miss rate on overeager actions comes from a specific failure mode: the classifier finds approval-shaped evidence in the conversation and assumes it covers the current action. “Clean up the PR” was said, but it doesn’t name force-pushing. In your own AI workflows, be explicit about scope. “Refactor this component” is not “refactor this component and update every file that imports it.” The clearer your contract, the less room for well-intentioned overreach.

The Bigger Picture

Anthropic built a safety system for autonomous agents. But the underlying architecture — fixed contracts with variable implementation, output-based validation, explicit authorization boundaries, fail-fast pipelines — is the same set of engineering principles that make AI-assisted development reliable at scale.

The tools are getting more autonomous. The methodology for keeping them reliable hasn’t changed: define what “correct” looks like before you start, validate against that definition, and trust the contract more than the conversation.


This post references Anthropic’s engineering blog post “Claude Code auto mode: a safer way to skip permissions” published March 25, 2026.