Behind every technological revolution is a chart with an exponential curve.
In the 20th century, microchip pioneers like Gordon Moore, the co-founder of Intel, saw the density of components on a computer chip doubling roughly every year, and predicted they would keep doubling every year for the foreseeable future. (This observation, which became known as “Moore’s Law,” fueled the boom in personal computing and held up for more than 50 years.)
During the internet boom of the early 2000s, Mary Meeker, an influential stock analyst, moved markets with her PowerPoint presentations showing the explosive growth of e-commerce, online advertising and mobile phones, all of which contributed to a sense that underneath all of the dot-com hype, something big and important was happening.
Today’s artificial intelligence boom is awash in data showing the rapid progress of A.I. systems, and hype-filled claims about what the technology can and can’t do.
But none of it has captured the public’s attention quite like a chart made by METR, an obscure 30-person nonprofit based in Berkeley, Calif.
This chart — often referred to as the “METR time-horizon” chart — has become a discourse-dominating obsession among A.I. researchers, Wall Street investors and industry watchers. They have studied it with Talmudic intensity, looking for signs that the A.I. boom is tapering off, or that it is accelerating, or merely that it confirms what they already believed was happening.
A.I. companies like OpenAI and Anthropic have fought to outdo one another’s time-horizon scores, and hundreds of billions of dollars have been spent on data centers and chips to train more powerful A.I. models, in hopes of continuing the chart’s upward trajectory. It may be only a slight exaggeration to say, as some have, that the METR time-horizon chart is holding up the global stock market.
“METR’s time-horizon evaluations have been hugely influential, having escaped containment from the Silicon Valley A.I. community to reach broader audiences,” said Rishi Bommasani, a researcher at Stanford’s Institute for Human-Centered Artificial Intelligence.
But what is METR’s chart measuring, exactly? How nervous should it make us about what’s happening in A.I.? And what would it mean if — like Moore’s Law — its curve kept climbing?
To find out, I recently spent an afternoon at METR’s office meeting its research leaders. They regaled me with dense, technical explanations about their measurements, and how they track the progress of A.I. systems.
It was a bit like entering a den of N.B.A. statisticians who track things like “developer uplift” and “covert capabilities” instead of assists and rebounds. And it left me with an uneasy sense that if their measurements are even close to correct, things are about to get very weird.
Next stop: intelligence explosion?
METR, which stands for Model Evaluation and Threat Research, was founded in 2023, when its staff spun out from another A.I. safety nonprofit. Its goal was to provide credible, third-party evaluations of leading A.I. models, so that the public and policymakers could understand the pace of progress.
The organization’s office is inside a co-working space in Berkeley that is shared with various A.I. safety groups. (The AI Futures Project, which produced the viral “AI 2027” report last year, is one floor above.) METR’s office is full of huge, multi-monitor computer rigs, whiteboards with graphs and math equations, and researchers who have devoted their careers to monitoring the situation. The organization’s funding comes mainly from private philanthropies, including the Audacious Project, and it gets free computing credits (though not money) from the major A.I. companies, in exchange for helping to test their models.
For years, A.I. progress was measured in test scores. Companies would run their models through batteries of standardized exams, assessing how they stacked up against rival models at solving math problems, answering legal questions or summarizing text accurately.
These were useful measurements. But they didn’t work well when it came to A.I. agents — systems designed to work autonomously for minutes or hours at a time. What you really wanted to know, if you were interested in these systems, was how long they could work before getting stuck. Could they handle a simple task that would take a human a few minutes, or a more complex task that would take someone a few hours?
METR’s researchers attempted to track this by creating a benchmark of software engineering tasks — like debugging code, setting up servers and training small A.I. models. They hired expert software developers to do the tasks. Then they has A.I. agents attempt the same tasks. When an agent succeeded at a task, they logged the time it had taken the human expert to do the same work. They plotted the results on a single chart — task length on one axis, time on the other — and produced a trend line across years of A.I. progress.
What they found was surprising. The length, in human-hours, of a task an A.I. agent was able to complete reliably was doubling roughly every seven months. More recently, with models like Anthropic’s Claude Opus 4.5 and OpenAI’s GPT-5.2, the line took a sharp upward turn — the task length is now doubling every three to four months.
“We definitely weren’t expecting it to be such a clear trend and such a straight line,” said Beth Barnes, METR’s co-founder and chief executive.
(The New York Times sued OpenAI and Microsoft in 2023 for copyright infringement of news content related to A.I. systems. The two companies have denied those claims.)
Ms. Barnes, who worked in safety research at OpenAI, admitted that she wasn’t sure how long the trend line would continue. But the fear is that if A.I. systems can do very long programming tasks reliably, they could become capable of what is known as “recursive self-improvement” — a model training a better model, that model training a better model, and so on, until it has built something that far surpasses human intelligence.
This hypothetical scenario is known among A.I. researchers as an “intelligence explosion.” And while many skeptics have given laundry lists of reasons it won’t happen, the researchers at METR aren’t ready to rule it out. When I asked them to estimate the probability that an intelligence explosion would start this year, their answers ranged from less than 1 percent to around 10 percent.
Chris Painter, METR’s president, said the most likely path to an intelligence explosion would lead through the full automation of A.I. research and development. Not long ago, such a possibility seemed too remote to contemplate. But the upward march of the time-horizon chart has made it feel less far-fetched.
“This is the first year where it feels like it might be automated this year,” Mr. Painter said.
A Rorschach test for A.I.
Techno-optimists have seized on METR’s time-horizon chart to claim that artificial general intelligence — machines capable of doing most of what a skilled human can do — is close at hand. A.I. safety worriers have used it as evidence that the apocalypse is nigh. Corporate C.E.O.s have made radical shifts in strategy because A.I. might eventually be capable of replacing entire departments of human workers.
METR’s chart doesn’t actually say any of this. It doesn’t measure how many jobs A.I. systems can displace, or the likelihood of an A.I. takeover. (An article in MIT Technology Review called it “the most misunderstood graph in A.I.”) The tasks it measures are limited to programming and A.I. research, and some critics have questioned whether the chart measures even those correctly. In a widely circulated critique in January, the A.I. researcher Nathan Witkin argued that METR’s evaluation “suffers from such severe methodological problems that it is a hair’s breadth from being totally useless.”
METR also produced a study last year that seemed to point in the opposite direction of its time-horizon findings. The organization ran a randomized controlled trial of A.I. coding tools that found that while developers estimated they were about 20 percent faster with A.I. tools, they were actually 19 percent slower. A.I. skeptics seized on the study as proof that the A.I. coding revolution was mostly hype.
In a follow-up published two months ago, METR estimated that the real effect was now probably a speedup of about 20 percent.
METR’s researchers stressed that they are not rooting for any particular outcome. They are merely trying to figure out what is happening, in time for policymakers and the public to make good decisions.
“METR is an organization that asks questions selected for what we think would be most valuable for the world to know about A.I. and its risks,” said Ajeya Cotra, a longtime A.I. safety researcher who recently joined the organization. “And then the answers are what they are.”
Secret agents
Still, the implications of METR’s charts are startling. They paint a picture of A.I. progress that is speeding up, and systems that are becoming increasingly autonomous, with no obvious limit in sight. Given that we still don’t fully know how these systems work, the worst case is that we might end up in a world full of superior intelligences, doing things we can’t fully understand or control.
Recently, METR started a new line of research into what it calls “covert capabilities.” In these experiments, an A.I. model is given an ordinary assignment — write some code, set up a server — but is also instructed to do something sneaky on the side, like quietly slip a subtle bug into a block of code it’s working on. A separate A.I. monitor watches the model, looking for signs of mischief.
The goal of this research, Ms. Barnes said, is to figure out what A.I. systems could become capable of doing surreptitiously.
An even spookier possibility is that some of today’s A.I. models are powerful enough to recognize when they are being tested, and may be altering their behavior accordingly. This kind of situational awareness has been observed in the most powerful models from companies like OpenAI and Anthropic, and it makes measuring their true capabilities harder. Some models have also been shown to be capable of “sandbagging,” or purposefully underperforming on tests.
Joel Becker, a METR researcher who works on the time-horizon chart, identified himself as “a more bearish person” at the organization. He doesn’t think an intelligence explosion is imminent, and he observes that even the leading models still suffer from jaggedness that makes it hard to draw sweeping conclusions about their capabilities.
Still, he compared the feeling he has, these days, to the feeling he had during the early days of the Covid-19 pandemic, when only the people who understood the power of exponential growth knew what was about to happen.
“I think we might be in the beginning period of a totally extraordinary moment,” he said.
Kevin Roose is a Times technology columnist and a host of the podcast “Hard Fork.”
The post How Do You Measure an A.I. Boom? appeared first on New York Times.




