Why Researchers Are Making AI Models Play Dungeons & Dragons

Researchers at UC San Diego, presenting their work at the NeurIPS conference, played some Dungeons & Dragons with large language models. They did this to test how they handle a variety of complex D&D-centric tasks—like creative teamwork, keeping the complex rules in line, and tracking all the consequences that pile up over time rather than resetting a prompt every so often.

It was a bit of a stress test to see how well the algorithm could handle long-form, complex tasks. Most AI evaluations focus on short tasks, such as answering a single question, before moving on. As UC San Diego computer scientist Raj Ammanabrolu put it, D&D is a “natural testing ground” for multistep planning, rule-following, and collaboration, all wrapped in an adventure where you get to kill some fantasy baddies with magic and then roll dice to see if you’ve successfully romanced a tavern wench.

The team paired several language models with a game engine that enforced D&D’s rules, maps, and resource locations, in an attempt to limit the pesky hallucinations that, more often than not, make AI an unreliable source of anything. The AI played both heroes and monsters in a combat-focused campaign, sometimes against other AI, sometimes against itself, and sometimes against about 2,000 experienced human players. Researchers evaluated how well the models tracked game state, chose actions, and stayed in character.

Why Scientists Are Using Dungeons & Dragons to Test AI

The results…were mixed, but occasionally theatrical, if a bit overwrought. Warlocks, always melodramatic in fiction, tended to appropriately overreact to everything, making long, dramatic speeches at every turn. Paladins did something similar by delivering aerobics pieces but at the worst possible times, when they were absolutely unnecessary. It sounds like AI took more annoying features of video games by making goblins repeat the same dumb NPC-style trash talk over and over again.

As expected, different models had distinct styles of play and action description. China’s DeepSeek-V3 preferred short, punchy action beats, but every character basically felt like the same person. Claude Haiku 3.5 did a better job of customizing dialogue responses to class and character. GPT-4o was a middle ground between the two, blending strong narration with vivid descriptions of its tactical movements.

Overall, the researchers say the models handled rule-based interaction surprisingly well, but all of them showed strain over longer durations, which is what the entire project was looking into in the first place. Smaller open-source models were especially inconsistent, and every system got worse at keeping it all in line as games dragged on.

The team’s next step is a full D&D campaign, with a complete story, deep exploration, and all the bells and whistles of a long-form realm D&D campaign. That’s where long-term memory really gets tested. Hopefully, the AI goblins come up with some better dialogue.

The post Why Researchers Are Making AI Models Play Dungeons & Dragons appeared first on VICE.