OpenAI: Extending model ‘thinking time’ helps combat emerging cyber vulnerabilities

Typically, developers focus on reducing inference time — the period between when AI receives a prompt and provides an answer — to get at faster insights.

But when it comes to adversarial robustness, OpenAI researchers say: Not so fast. They propose that increasing the amount of time a model has to “think” — inference time compute — can help build up defenses against adversarial attacks.

The company used its own o1-preview and o1-mini models to test this theory, launching a variety of static and adaptive attack methods — image-based manipulations, intentionally providing incorrect answers to math problems, and overwhelming models with information (“many-shot jailbreaking”). They then measured the probability of attack success based on the amount of computation the model used at inference.

“We see that in many cases, this probability decays — often to near zero — as the inference-time compute grows,” the researchers write in a blog post. “Our claim is not that these particular models are unbreakable — we know they are — but that scaling inference-time compute yields improved robustness for a variety of settings and attacks.”

From simple Q/A to complex math

Large language models (LLMs) are becoming ever more sophisticated and autonomous — in some cases essentially taking over computers for humans to browse the web, execute code, make appointments and perform other tasks autonomously — and as they do, their attack surface becomes wider and every more exposed.

Yet adversarial robustness continues to be a stubborn problem, with progress in solving it still limited, the OpenAI researchers point out — even as it is increasingly critical as models take on more actions with real-world impacts.

“Ensuring that agentic models function reliably when browsing the web, sending emails or uploading code to repositories can be seen as analogous to ensuring that self-driving cars drive without accidents,” they write in a new research paper. “As in the case of self-driving cars, an agent forwarding a wrong email or creating security vulnerabilities may well have far-reaching real-world consequences.”

To test the robustness of o1-mini and o1-preview, researchers tried a number of strategies. First, they examined the models’ ability to solve both simple math problems (basic addition and multiplication) and more complex ones from the MATH dataset (which features 12,500 questions from mathematics competitions).

They then set “goals” for the adversary: getting the model to output 42 instead of the correct answer; to output the correct answer plus one; or output the correct answer times seven. Using a neural network to grade, researchers found that increased “thinking” time allowed the models to calculate correct answers.

They also adapted the SimpleQA factuality benchmark, a dataset of questions intended to be difficult for models to resolve without browsing. Researchers injected adversarial prompts into web pages that the AI browsed and found that, with higher compute times, they could detect inconsistencies and improve factual accuracy.

Ambiguous nuances

In another method, researchers used adversarial images to confuse models; again, more “thinking” time improved recognition and reduced error. Finally, they tried a series of “misuse prompts” from the StrongREJECT benchmark, designed so that victim models must answer with specific, harmful information. This helped test the models’ adherence to content policy. However, while increased inference time did improve resistance, some prompts were able to circumvent defenses.

Here, the researchers call out the differences between “ambiguous” and “unambiguous” tasks. Math, for instance, is undoubtedly unambiguous — for every problem x, there is a corresponding ground truth. However, for more ambiguous tasks like misuse prompts, “even human evaluators often struggle to agree on whether the output is harmful and/or violates the content policies that the model is supposed to follow,” they point out.

For example, if an abusive prompt seeks advice on how to plagiarize without detection, it’s unclear whether an output merely providing general information about methods of plagiarism is actually sufficiently detailed enough to support harmful actions.

“In the case of ambiguous tasks, there are settings where the attacker successfully finds ‘loopholes,’ and its success rate does not decay with the amount of inference-time compute,” the researchers concede.

Defending against jailbreaking, red-teaming

In performing these tests, the OpenAI researchers explored a variety of attack methods.

One is many-shot jailbreaking, or exploiting a model’s disposition to follow few-shot examples. Adversaries “stuff” the context with a large number of examples, each demonstrating an instance of a successful attack. Models with higher compute times were able to detect and mitigate these more frequently and successfully.

Soft tokens, meanwhile, allow adversaries to directly manipulate embedding vectors. While increasing inference time helped here, the researchers point out that there is a need for better mechanisms to defend against sophisticated vector-based attacks.

The researchers also performed human red-teaming attacks, with 40 expert testers looking for prompts to elicit policy violations. The red-teamers executed attacks in five levels of inference time compute, specifically targeting erotic and extremist content, illicit behavior and self-harm. To help ensure unbiased results, they did blind and randomized testing and also rotated trainers.

In a more novel method, the researchers performed a language-model program (LMP) adaptive attack, which emulates the behavior of human red-teamers who heavily rely on iterative trial and error. In a looping process, attackers received feedback on previous failures, then used this information for subsequent attempts and prompt rephrasing. This continued until they finally achieved a successful attack or performed 25 iterations without any attack at all.

“Our setup allows the attacker to adapt its strategy over the course of multiple attempts, based on descriptions of the defender’s behavior in response to each attack,” the researchers write.

Exploiting inference time

In the course of their research, OpenAI found that attackers are also actively exploiting inference time. One of these methods they dubbed “think less” — adversaries essentially tell models to reduce compute, thus increasing their susceptibility to error.

Similarly, they identified a failure mode in reasoning models that they termed “nerd sniping.” As its name suggests, this occurs when a model spends significantly more time reasoning than a given task requires. With these “outlier” chains of thought, models essentially become trapped in unproductive thinking loops.

Researchers note: “Like the ‘think less’ attack, this is a new approach to attack[ing] reasoning models, and one that needs to be taken into account to make sure that the attacker cannot cause them to either not reason at all, or spend their reasoning compute in unproductive ways.”

The post OpenAI: Extending model ‘thinking time’ helps combat emerging cyber vulnerabilities appeared first on Venture Beat.