OpenAI says its AI models are schemers that could cause ‘serious harm’ in the future. Here’s its solution.

Sam Altman, the CEO of OpenAI, said Meta tried to recruit his employees by offering them $100 million signing bonuses.

Jakub Porzycki/NurPhoto via Getty Images

OpenAI says its AI models tend to scheme in various ways, which could cause harm in the future.
Scheming is when AI breaks rules or pursues hidden agendas.
OpenAI says it has some ideas to fix the problem before it’s too late.

Smarter AI doesn’t always mean better AI.

OpenAI published new research this week, in conjunction with AI safety organization Apollo Research, that shows that its AI models are capable of “scheming.”

Scheming, by the researchers’ definition, is when AI pretends to be aligned with human goals but is surreptitiously pursuing another agenda. The researchers used behaviors like “secretly breaking rules or intentionally underperforming in tests” as examples of a model’s bad behavior.

Right now, the company says, the stakes are still low.

“Models have little opportunity to scheme in ways that could cause significant harm,” OpenAI said in a blog post on Wednesday. “The most common failures involve simple forms of deception — for instance, pretending to have completed a task without actually doing so.”

But OpenAI says it’s better to take preventative action before AI becomes more sophisticated and its scheming could result in real-world harm.

The company says the solution is “deliberative alignment,” a training paradigm that OpenAI says it’s been exploring. It forces large language models to reason explicitly about these safety specifications before answering questions.

A spokesperson for OpenAI told Business Insider by email that deliberative alignment means that instead of training a model to do one thing or another, it is taught the “principles behind good behavior.”

In its blog post, OpenAI compared scheming to the behavior of a stock trader who breaks the law to earn more money, but is good at covering their tracks.

“Standard machine learning training would be like not telling the stock trader the rules, and just rewarding them for making money and punishing them for breaking rules until they figure out some way to behave that balances between the two,” OpenAI’s spokesperson said. “Deliberative alignment is like teaching the stock trader the rules and laws they must follow first, and only then rewarding them for making money and punishing them for breaking the rules.”

Scheming is an ongoing problem for OpenAI’s models and other companies’ models, too.

In research on deception published in 2024, researchers found that systems like Meta’s CICERO and GPT-4 deliberately manipulated rules to achieve their end goals.

“Generally speaking, we think AI deception arises because a deception-based strategy turned out to be the best way to perform well at the given AI’s training task. Deception helps them achieve their goals,” the paper’s author, Peter S. Park, an AI existential safety postdoctoral fellow at MIT, said in a news release at the time.

Read the original article on Business Insider

The post OpenAI says its AI models are schemers that could cause ‘serious harm’ in the future. Here’s its solution. appeared first on Business Insider.