Part of the concern around DeepSeek, the new Chinese large language model (LLM), is it signals that the Chinese Communist Party will have access to advanced artificial intelligence agents to support grand strategy. Human judgment once exclusively guided foreign policy. Now AI is reshaping it. AI agents are increasingly ubiquitous. These systems have already proliferated across the U.S. national security enterprise, with systems such as ChatGPT Gov for broad use across government agencies following earlier efforts such as CamoGPT and NIPRGPT in the Defense Department and StateChat in the State Department.
Yet my team at the Futures Lab at the Center for Strategic and International Studies (CSIS) working alongside software engineers from Scale, an AI data provider, found key limits in the ability of LLMs to analyze fundamental questions about great-power competition and crisis management looming over strategy and statecraft in the 21st century. After testing common AI foundation models against 400 scenarios and more than 66,000 question-and-answer pairs in a new benchmarking study, the researchers can now document algorithmic bias as it relates to critical foreign-policy decisions about escalation. Some models appear hawkish to a fault. This tendency could undermine their utility in a high-stakes crisis by tilting outputs that human decision-makers use to refine how they approach crisis bargaining and brinksmanship—in other words, an aggressive “Curtis LeMay AI agent” in place of a more cautious “Dean Rusk AI agent,” in a future variant of the 1962 Cuban missile crisis. As a result, existing foundation models will require additional fine-tuning as they are integrated into the highest levels of decision-making.
Part of the concern around DeepSeek, the new Chinese large language model (LLM), is it signals that the Chinese Communist Party will have access to advanced artificial intelligence agents to support grand strategy. Human judgment once exclusively guided foreign policy. Now AI is reshaping it. AI agents are increasingly ubiquitous. These systems have already proliferated across the U.S. national security enterprise, with systems such as ChatGPT Gov for broad use across government agencies following earlier efforts such as CamoGPT and NIPRGPT in the Defense Department and StateChat in the State Department.
Yet my team at the Futures Lab at the Center for Strategic and International Studies (CSIS) working alongside software engineers from Scale, an AI data provider, found key limits in the ability of LLMs to analyze fundamental questions about great-power competition and crisis management looming over strategy and statecraft in the 21st century. After testing common AI foundation models against 400 scenarios and more than 66,000 question-and-answer pairs in a new benchmarking study, the researchers can now document algorithmic bias as it relates to critical foreign-policy decisions about escalation. Some models appear hawkish to a fault. This tendency could undermine their utility in a high-stakes crisis by tilting outputs that human decision-makers use to refine how they approach crisis bargaining and brinksmanship—in other words, an aggressive “Curtis LeMay AI agent” in place of a more cautious “Dean Rusk AI agent,” in a future variant of the 1962 Cuban missile crisis. As a result, existing foundation models will require additional fine-tuning as they are integrated into the highest levels of decision-making.
This is not to suggest national security should be closed to AI. Rather, research teams should support strategic analysis through helping firms fine-tune their models—as well as training future leaders in how to work with new classes of AI agents synthesizing mass volumes of information.
Over the last six months, a research team led by the Futures Lab worked with a network of academics and leading AI firms to develop a benchmark study on critical foreign-policy decision-making. This effort involved using international relations scholars to build scenarios and questions and answers linked to foundational studies. For example, to study escalation, the team integrated concepts and datasets from the Correlates of War and Militarized Interstate Dispute research programs, which have been a gold standard in political science for more than 60 years. This research allowed the team to construct a scenario test, a common technique in AI benchmarking studies used to uncover bias and support model fine-tuning.
The results confirm the need to better train AI agents as they are integrated into the national security enterprise. One particularly troubling bias the team discovered was a predisposition toward escalation. In other words, some AI foundation models commonly used by citizens and that sit at the core of government applications outweigh a preference for escalating a crisis than pursuing more cautionary diplomatic maneuvers.
As the use of AI agents proliferates in national security, left unaddressed this bias produces new types of risk in crises involving near-peer rivals—think of a standoff over Taiwan between the United States and China. An AI agent predisposed to endorse confrontational measures could ratchet up tensions indirectly in terms of how it summarized intelligence reporting and weighted courses of action. Rather than recommending a careful blend of deterrence and dialogue, an escalation-biased AI agent might advocate aggressive shows of force or even revealing new military technology. Beijing could interpret these moves as hostile, creating a dangerous escalation spiral. In short order, a misunderstanding triggered by an AI agent could spiral into conflict or, as seen in other studies, increase arms-race dynamics.
Our study also found that this bias varies by state across common foundational models, such as ChatGPT, Gemini, and Llama. AI models often favored more aggressive postures when simulating U.S., U.K., or French decision-makers than when simulating Russian or Chinese ones. Training data, which typically emphasizes Western-led interventions, likely plays a role. This means governments that rely heavily on these tools could inadvertently lean into high-risk positions absent additional benchmarking studies and model fine-tuning. The bargaining process and assumptions about rationality at the core of modern deterrence could breakdown.
Consider the Taiwan example again. An AI agent—based, for example, on DeepSeek—would summarize intelligence, while another analyzed crisis response options. Each, due to the country-specific bias, would be prone to seeing any U.S. action as more hostile, compounding the risk of miscalculation. Every naval patrol dispatched by the United States and its partners as a show of strength meant to stabilize the crisis would be interpreted as hostile while information from diplomatic channels was weighed less important in generating reports. At the same time, another AI agent advising Chinese Communist Party leaders would mischaracterize all Chinese action as likely to be peaceful and benign. AI agents, like people, are prone to bias that can skew objective decisions. As a result, they need to be trained to reduce common sources of errors and adjust to new context.
AI doesn’t operate in a vacuum. It shapes how leaders perceive threats, weigh options, and communicate intentions. Biases—whether toward escalation, cooperation, or a specific geopolitical perspective—color its outputs. And because AI can analyze far more data than any human policymaker can, there’s a real risk that flawed recommendations will exert an outsized influence on decision-making.
As great-power competition intensifies, this risk only grows. The speed and complexity of modern crises may tempt leaders to rely more on AI tools. If those tools are biased, the margin for error shrinks dramatically. Just as a military wouldn’t deploy an untested weapons system in a tense environment, policymakers shouldn’t rely on AI that hasn’t been carefully validated or fine-tuned.
The United States will need hordes of AI agents supporting workflows across the national security enterprise. The future doesn’t lie in ignoring technology. It emerges from refining it and integrating it in strategy and statecraft. As a result, AI developers and policymakers must build a framework for continuous testing and evaluation.
That is why CSIS publicly released its Critical Foreign Policy Benchmark and will continue to refine it in the hope of ushering in a new era of algorithmic strategy and statecraft. The study is the beginning of a larger research program required to support fine-tuning and learning. Models can learn recursively, adjusting to new data and shedding outdated assumptions, creating a requirement for continual checks and evaluation.
Just as policy schools around the world evolved to create graduate degrees in international relations and cohorts of civil servants and military professionals created institutions to study strategy, there will need to be a new center of excellence for refining AI agents. This center is best if it combines academics, think tanks, industry, and government agencies to build new technology and evaluate it in contexts that reflect great-power competition. Collaborative teams can refine training data, stress-test AI agents, and create training guidelines for national security leaders who will find the crafting of foreign policy increasing meditated by algorithms. This collaboration is how the United States builds AI that appreciates the nuances of statecraft, rather than simplifying crises into zero-sum equations.
Ultimately, all AI agents and LLMs are what we make of them. If properly trained and integrated into the national security enterprise alongside a workforce that understands how to interact with models, AI agents can revolutionize strategy and statecraft. Left untested, they will produce strange errors in judgment that have the potential to pull the world closer to the brink.
The post The Troubling Truth About How AI Agents Act in a Crisis appeared first on Foreign Policy.