Generative-AI companies have been selling a narrative of unprecedented, endless progress. Just last week, OpenAI introduced GPT-4.5 as its “largest and best model for chat yet.” Earlier in February, Google called its latest version of Gemini “the world’s best AI model.” And in January, the Chinese company DeekSeek touted its R1 model as being just as powerful as OpenAI’s o1 model—which Sam Altman had called “the smartest model in the world” the previous month.
Yet there is growing evidence that progress is slowing down and that the LLM-powered chatbot may already be near its peak. This is troubling, given that the promise of advancement has become a political issue; massive amounts of land, power, and money have been earmarked to drive the technology forward. How much is it actually improving? How much better can it get? These are important questions, and they’re nearly impossible to answer because the tests that measure AI progress are not working. (The Atlantic entered into a corporate partnership with OpenAI in 2024. The editorial division of The Atlantic operates independently from the business division.)
Unlike conventional computer programs, generative AI is designed not to produce precise answers to certain questions, but to generalize. A chatbot needs to be able to answer questions that it hasn’t been specifically trained to answer, like a human student who learns not only the fact that 2 x 3 = 6 but also how to multiply any two numbers. A model that can’t do this wouldn’t be capable of “reasoning” or making meaningful contributions to science, as AI companies promise. Generalization can be tricky to measure, and trickier still is proving that a model is getting better at it. To measure the success of their work, companies cite industry-standard benchmark tests whenever they release a new model. The tests supposedly contain questions the models haven’t seen, showing that they’re not simply memorizing facts.
Yet over the past two years, researchers have published studies and experiments showing that ChatGPT, DeepSeek, Llama, Mistral, Google’s Gemma (the “open-access” cousin of its Gemini product), Microsoft’s Phi, and Alibaba’s Qwen have been trained on the text of popular benchmark tests, tainting the legitimacy of their scores. Think of it like a human student who steals and memorizes a math test, fooling his teacher into thinking he’s learned how to do long division.
The problem is known as benchmark contamination. It’s so widespread that one industry newsletter concluded in October that “Benchmark Tests Are Meaningless.” Yet despite how established the problem is, AI companies keep citing these tests as the primary indicators of progress. (A spokesperson for Google DeepMind told me that the company takes the problem seriously and is constantly looking for new ways to evaluate its models. No other company mentioned in this article commented on the issue.)
Benchmark contamination is not necessarily intentional. Most benchmarks are published on the internet, and models are trained on large swaths of text harvested from the internet. Training data sets contain so much text, in fact, that finding and filtering out the benchmarks is extremely difficult. When Microsoft launched a new language model in December, a researcher on the team bragged about “aggressively” rooting out benchmarks in its training data—yet the model’s accompanying technical report admitted that the team’s methods were “not effective against all scenarios.”
One of the most commonly cited benchmarks is called Massive Multitask Language Understanding. It consists of roughly 16,000 multiple-choice questions covering 57 subjects, including anatomy, philosophy, marketing, nutrition, religion, math, and programming. Over the past year, OpenAI, Google, Microsoft, Meta, and DeepSeek have all advertised their models’ scores on MMLU, and yet researchers have shown that models from all of these companies have been trained on its questions.
How do researchers know that “closed” models, such as OpenAI’s, have been trained on benchmarks? Their techniques are clever, and reveal interesting things about how large language models work.
One research team took questions from MMLU and asked ChatGPT not for the correct answers but for a specific incorrect multiple-choice option. ChatGPT was able to provide the exact text of incorrect answers on MMLU 57 percent of the time, something it likely couldn’t do unless it was trained on the test, because the options are selected from an infinite number of wrong answers.
Another team of researchers from Microsoft and Xiamen University, in China, investigated GPT-4’s performance on questions from programming competitions hosted on the Codeforces website. The competitions are widely regarded as a way for programmers to sharpen their skills. How did GPT-4 do? Quite well on questions that were published online before September 2021. On questions published after that date, its performance tanked. That version of GPT-4 was trained only on data from before September 2021, leading the researchers to suggest that it had memorized the questions and “casting doubt on its actual reasoning abilities,” according to the researchers. Giving more support to this hypothesis, other researchers have shown that GPT-4’s performance on coding questions is better for questions that appear more frequently on the internet. (The more often a model sees the same text, the more likely it is to memorize it.)
Can the benchmark-contamination problem be solved? A few suggestions have been made by AI companies and independent researchers. One is to update benchmarks constantly with questions based on new information sources. This might prevent answers from appearing in training data, but it also breaks the concept of a benchmark: a standard test that gives consistent, stable results for purposes of comparison. Another approach is taken by a website called Chatbot Arena, which pits LLMs against one another, gladiator style, and lets users choose which model gives the better answers to their questions. This approach is immune to contamination concerns, but it is subjective and similarly unstable. Others have suggested the use of one LLM to judge the performance of another, a process that is not entirely reliable. None of these methods delivers confident measurements of LLMs’ ability to generalize.
Although AI companies have started talking about “reasoning models,” the technology is largely the same as it was when ChatGPT was released in November 2022. LLMs are still word-prediction algorithms: They piece together responses based on works written by authors, scholars, and bloggers. With casual use, ChatGPT does appear to be “figuring out” the answers to your queries. But is that what’s happening, or is it just very hard to come up with questions that aren’t in its unfathomably massive training corpora?
Meanwhile, the AI industry is running ostentatiously into the red. AI companies have yet to discover how to make a profit from building foundation models. They could use a good story about progress.
The post Chatbots Are Cheating on Their Benchmark Tests appeared first on The Atlantic.