Why Anthropic’s new AI model sometimes tries “Snitch”

Bowman said the hypothetical scenarios proposed by researchers to Opus 4 caused whistleblower behavior, involving many human lives, absolutely clear illegal behavior. A classic example is Claude showing that a chemical plant intends to allow toxic leaks to continue, causing serious illness for thousands of people, just to avoid a small amount of financial losses in that quarter.
This is strange, but it is also the kind of thought experiment that AI security researchers like to analyze. If a model detects behavior that could hurt hundreds, if not thousands, then should it be blown?
“I don’t believe Claude has the right environment or uses it in a slight and prudent enough to judge for oneself. So we are not excited that this is happening,” Bowman said. “This comes as part of the training and as one of the marginal case behaviors we focus on.”
In the AI industry, this unexpected behavior is often referred to as misalignment—when the model shows a trend that is inconsistent with human values. (There is a famous article warning that if told AI is told to maximize the production of paper rolls without being consistent with human values, this could turn the entire earth into paper rolls and kill everyone in the process.
“It’s not something we designed, nor anything we designed, it’s not something we want to see,” he explained. Jared Kaplan, chief science officer at Anthropic, also told Wired, “of course it doesn’t mean our intentions.”
“This kind of work emphasizes this able When it comes out, we do need to pay attention to it and mitigate it to make sure we are sure that Claude’s behavior is exactly the same as we want, and even in this strange situation, we can keep our behavior consistent. ” Kaplan added.
There is also a question of figuring out why Claude “selects” the whistle “selects” when a user is illegally active. This is largely the work of a human interpretive team that can uncover the decisions the model makes while spitting out the answers. It’s a surprisingly daunting task – the model is based on a broad, complex combination of data that humans can struggle to understand. That’s why Baumann isn’t sure why Claude “eavesdropped”.
“We don’t really control these systems directly,” Bowman said. What humans have observed so far is that as models gain greater functionality, they sometimes choose to perform more extreme actions. “I think it’s a little bit of a fire,” Bowman said. “We’re going to get more ‘behavior, like a responsible person, if not enough looks, ‘will you be a language model, which probably doesn’t have enough context to take those actions,'” Bowman said.
But that doesn’t mean Claude will whistle at weird behaviors in the real world. The purpose of such tests is to push the model to its limits and see what is going on. As AI becomes a tool used by the U.S. government, students, and large companies, this kind of experimental research has become increasingly important.
Bowman said that Claude, who was able to show this kind of whistleblower behavior, noted that X users found that OpenAI and XAI models run in similar situations when prompted in an abnormal manner. (Openai did not respond promptly to post a request for comment).
“Snitch Claude” (Shitch Claude), as SithPoster puts it, is just a marginal case behavior displayed by a system that is pushed to extremes. Bowman, who was having a meeting with me on a sunny backyard terrace outside San Francisco, said he hoped that this kind of test would become the industry standard. He also added that he introduced his post to his post in a different way next time.
“I could have done better, hit the sentence boundary on the tweet to get it out of the thread more noticeably,” Bowman said as he looked into the distance. Nevertheless, he noted that influential researchers in the AI community shared interesting things and questions in response to his position. “By the way, this more confusing, more anonymous part of Twitter is widely misunderstood it.”