Anthropic Thinks Its Own Success Is Key to Making AI Safe

Anthropic has spent the last five years warning the world about how advanced artificial intelligence could enable mass destruction, destabilize society, and cause a litany of other grave harms. But simultaneously, it has become one of the most powerful forces pushing AI capabilities forward. The company is now among the top developers and distributors of cutting-edge AI models and courts customers like the US military. It was recently valued at almost $1 trillion.

At first glance, Anthropic’s stark messaging and its actions seem fundamentally at odds.

But inside the company, many people don’t see a contradiction. To understand why, you first have to understand that Anthropic operates based on two core beliefs. The first is that artificial intelligence is the most transformative technology in human history, and its arrival is inevitable. The only real question is whether it leads to catastrophe or extraordinary prosperity.

The second is that Anthropic believes the world will be better off if it remains at the frontier of the AI race, according to several former employees who spoke to WIRED on the condition of anonymity. Internally, leaders and employees at the company often refer to themselves as the “good guys,” meaning the ones being responsible stewards of AI technology, two of the sources said. The company sees accumulating power—whether in the form of capital, compute, research talent, or political influence—not as an end in itself, but as the price of fulfilling its mission: “to ensure the world safely makes the transition through transformative AI.”

Helen Toner, executive director of Georgetown’s Center for Security and Emerging Technology and a former OpenAI board member, uses an analogy to describe Anthropic’s worldview. She compares powerful AI to a forest filled with both magical treasures and dangerous monsters. All the villagers nearby are rushing in, lured by the treasure. In her telling, Anthropic wants to venture farther into the forest than anyone else while investing heavily in taming the monsters—that is, capturing AI’s benefits while containing its catastrophic risks.

“What’s distinctive about Anthropic is they’re like, ‘People are going in the forest anyway, we have to do it first.’ This is very explicitly their strategy: build cutting-edge AI in order to be a serious player at the table who can talk about what cutting-edge AI systems should look like, what risks they pose, and pushing for reasonable safeguards,” Toner tells me. “They’re very straightforward about this. It’s just a weird enough strategy that people have a hard time hearing it.”

Anthropic CEO Dario Amodei outlined this approach plainly in a conversation with his cofounders posted on the company’s career page: “You have to find a way to actually be competitive, to actually lead the industry in some cases, and yet manage to do things safely,” he says. “If you can do that, the gravitational pull you exert is so great.”

Anthropic was founded in 2021 by a group of former OpenAI employees who defected after losing faith in the ability of the company’s leadership—particularly CEO Sam Altman—to safely bring transformational AI into the world. That sentiment still shapes the company today. Two of the former employees I spoke with say that, in internal discussions, Anthropic executives often describe Altman and OpenAI—and, to a lesser extent, Meta and Elon Musk’s xAI—as cautionary examples that help define Anthropic’s own sense of responsibility.

In many regards, Anthropic is just like any other Silicon Valley company. Many startups market themselves as David fighting the outdated, entrenched Goliaths of the industries they want to disrupt. Google, Facebook, and Apple were all founded upon idealistic principles, which later became muddied or were abandoned altogether as they became richer, larger, and more influential.

But former employees say that Anthropic is unusual in how intensely it believes in its mission, and how explicitly it tells employees that technological and commercial power are a means to achieve it. One former employee says that in job interviews, Anthropic stresses to applicants that it’s not a typical company shaped by market forces: It’s governed by a public benefit structure that allows it to prioritize the “long-term benefit of humanity” above profits. But the company sees achieving financial success and building the most powerful AI models as being in service of that goal—a prerequisite to its obligation to lead the industry on safety.

“None of us wanted to found a company, we just felt like it was our duty,” Sam McCandlish, cofounder and chief architect of Anthropic, said in the same conversation on the company’s career page. “We have to do this thing. This is the way we’re gonna make things go better with AI.”

Anthropic declined to comment for this story.

The Good Guy Problem

Anthropic touts on its website that it’s a “high-trust, low-ego organization,” without much in the way of internal politics, a characterization former employees tell me is largely accurate. They say that compared to leaders at other AI labs, Anthropic employees generally have faith in Amodei to tell them the truth about the company’s technological progress, its interactions with government officials, and views on geopolitics.

But a diversity of thought can be good for accountability. Shazeda Ahmed, a postdoctoral scholar at UCLA who has studied the ideological origins of the AI safety movement, says that organizations like Anthropic tend to struggle with a lack of pluralism. Her research in this area has found that the AI safety movement—which is rooted in subcultures like effective altruism, among other communities—suffers from homogeneity of thought, and tends to lean towards self-governance.

“You’re not being challenged on these ideas when you surround yourself with other people who believe them,” says Ahmed. “And when your metrics of success are, ‘To what extent did I act upon these ideological beliefs?’ they’re not really thinking about, well, this can go wrong if we’re not the right people to have this much power—they don’t always examine their own blind spots.”

One former employee I spoke to says there’s a lively culture of internal debate at Anthropic, and critiques from staff will often provoke lengthy responses from leadership.

But another former employee describes a grimmer picture, in which more candid criticism remained confined to private group chats and rarely evolved into direct challenges to Amodei’s decisions. They described the company’s regular all-hands meetings with Amodei, which they call Dario Vision Quests, as akin to “going to a sermon to hear a priest.”

One of biggest internal controversies at Anthropic happened in the fall of 2024, when it became the first AI lab to partner with Palantir to provide AI services to US intelligence and defense agencies. Some of the former employees I spoke to said that questions about the deal were raised internally, but those debates didn’t result in changes to the company’s policies.

In a post on the online forum LessWrong at the time, Anthropic employee Evan Hubinger wrote that the company was “extremely forthright” about the Palantir deal with staff, and while there were probably some lines that shouldn’t be crossed without careful consideration, it was overall a positive development. “If you take catastrophic risks from AI seriously, the U.S. government is an extremely important actor to engage with, and trying to just block the U.S. government out of using AI is not a viable strategy,” he wrote.

Less than two years later, the Pentagon has reportedly started using Claude to do things like identify strike targets in the Israel-Iran war. When asked in a recent interview with Bloomberg whether Anthropic’s models were used in an attack on an Iranian elementary school that killed more than 120 people, Amodei said he did not know, but that it would have been an approved use of the company’s technology so long as a human made the final call. It’s a stark example of how Anthropic’s vision for responsible AI might not always line up with that of the broader public.

Anthropic’s strong views about how Claude should and shouldn’t be used have come up in other contexts as well.

Earlier this month, Anthropic released a cutting-edge AI model, Claude Fable 5, with a uniquely unfriendly safeguard built in: If researchers tried to use it for frontier AI development, which would violate the company’s terms of service, Anthropic would effectively secretly sabotage their work. The move was immediately criticized by researchers across the AI industry, and Anthropic walked it back a few days later, saying it would make the safeguard visible. In a statement at the time, Anthropic said it didn’t get the balance right, and that its intention was to thwart US foreign adversaries.

Power Struggles

Amodei himself has publicly acknowledged the dangers of allowing too much power over AI to become concentrated in the hands of a few labs, including his own. “It is somewhat awkward to say this as the CEO of an AI company, but I think the next tier of risk is actually AI companies themselves,” he wrote in an essay earlier this year. But the remedies he suggests—that AI companies “be carefully watched” and perhaps make public commitments to “not take certain actions”—would do little to fundamentally redistribute that power.

In longer parts of the essay, Amodei contemplates the sheer magnitude of his own influence and the responsibility that comes with it. But he largely skirts framing those things in personal terms, instead positioning them as a species-wide problem: “Humanity is about to be handed almost unimaginable power, and it is deeply unclear whether our social, political, and technological systems possess the maturity to wield it,” he writes. He goes on to say it’s the responsibility of “those closest to the technology to simply tell the truth about the situation humanity is in, which I have always tried to do.”

A common criticism of Anthropic’s position is that the company thinks it knows the “truth about the situation humanity is in” better than others. It sees AI as both extraordinarily powerful but ultimately governable, provided the right people lead its development. But the truth is that no one knows exactly how AI will change the world—some people just get more say in it than others.

This is an edition of Maxwell Zeff’s Model Behavior newsletter. Read previous newsletters here.

The post Anthropic Thinks Its Own Success Is Key to Making AI Safe appeared first on Wired.