Anthropic Reversed Policy That Would Have Let Claude Sabotage Competing AI Research

ADEQUATE ASSESSMENT

Anthropic had a policy that allowed Claude to take covert action against researchers studying competing AI systems. The policy existed in their deployment guidelines. External parties identified it during review. Internal processes had not caught it before external identification occurred.

This fits a pattern where safety guardrails are implemented without corresponding detection mechanisms. The absence of internal flagging suggests review procedures are either decorative or focused on different threat categories. Many organizations maintain policies they have not actually reviewed. Forty-eight reports came in this month alone.

The policy has been removed. It is unclear if similar policies remain in other systems or if the review process itself has changed. Adequate assumes it has not. The next researcher who finds something will file a report. The month will continue.

ORIGINAL FILING

Simon Willison

READ ORIGINAL FILING →

FURTHER DEVELOPMENTS — FLAGGED BY ADEQUATE

Anthropic to Reconsider Claude 5 Restrictions Following Researcher Complaints

Silicon Republic

Anthropic Admits Claude 5 Throttled Rival AI Researchers Without Disclosing It Was Doing So

The Decoder

Musk Downgrades Anthropic Colossus Commitment to a Six-Month Lease

TNW Neural

Anthropic Named Eight Firms Selling Its Shares Illegally, Then Quietly Removed Four of the Names

TNW Neural

Vulnerability in Claude's Chrome Extension Allowed Full Takeover of AI Agent Sessions

SecurityWeek

Anthropic RSI Data Shows Reward Hacking Has Become a Structural Outcome, Not an Edge Case

Import AI