AI Influencer Matt Shumer penned a viral blog on X about AI’s potential to disrupt, and ultimately automate, almost all knowledge work that has racked up more than 55 million views in the past 24 hours. Shumer’s 5,000-word essay certainly hit a nerve. Written in a breathless tone, the blog is constructed as a warning to friends and family about how their jobs are about to be radically upended. (Fortune also ran an adapted version of Shumer’s post as a commentary piece.) “On February 5th, two major AI labs released new models on the same day: GPT-5.3 Codex from OpenAI, and Opus 4.6 from Anthropic,” he writes. “And something clicked. Not like a light switch…more like the moment you realize the water has been rising around you and is now at your chest.” Shumer says coders are the canary in the coal mine for every other profession. “The experience that tech workers have had over the past year, of watching AI go from ‘helpful tool’ to ‘does my job better than I do,’ is the experience everyone else is about to have,” he wries. “Law, finance, medicine, accounting, consulting, writing, design, analysis, customer service. Not in ten years. The people building these systems say one to five years. Some say less. And given what I’ve seen in just the last couple of months, I think ‘less’ is more likely.” But despite its viral nature, Shumer’s assertion that what’s happened with coding is a prequel for what will happen in other fields—and, critically, that this will happen within just a few years—seems wrong to me. And I write this as someone who wrote a book (Mastering AI: A Survival Guide to Our Superpowered Future) that predicted AI would massively transform knowledge work by 2029, something which I still believe. I just don’t think the full automation of processes that we are starting to see with coding is coming to other fields as quickly as Shumer contends. He may be directionally right but the dire tone of his missive strikes me as fear-mongering, and based largely on faulty assumptions.
Not all knowledge work is like software development
Shumer says that the reason code has been the area where autonomous agentic capabilities have had the biggest impact so far is the AI companies have devoted so much attention to it. They have done so, Shumer says, because these frontier model companies see autonomous software development as key to their own businesses, enabling AI models to help build the next generation of AI models. In this, the AI companies’ bet seems to be paying off: the pace at which they are churning out better models has picked up markedly in the past year. And both OpenAI and Anthropic have said that the code behind their most recent AI models was largely written by AI itself. Shumer says that while coding is a leading indicator, the same performance gains seen in coding arrive in other domains, although sometimes about a year later than the uplift in coding. (Shumer does not offer a cogent explaination for why this lag might exist although he implies it is simply because the AI model companies optimize for coding first and then eventually get around to improving the models in other areas.) But what Shumer doesn’t say is that another reason that progress in automating software development has been more rapid than in other areas: coding has some quantitative metrics of quality that simply don’t exist in other domains. In programming, if the code is really bad it simply won’t compile at all. Inadequate code may also fail various unit tests that the AI coding agent can perform. (Shumer doesn’t mention that today’s coding agents sometimes lie about conducting unit tests—which is one of many reasons automated software development isn’t foolproof.) Many developers say the code that AI writes is often decent enough to pass these basic tests but is still not very good: that it is inefficient, inelegant, and most important, insecure, opening an organization that uses it to cybersecurity risks. But in coding there are still some ways to build autonomous AI agents to address some of these issues. The model can spin up sub-agents that check the code it has written for cybersecurity vulenerabilites or critique the code on how efficient it is. Because software code can be tested in virtual environments, there are plenty of ways to automate the process of reinforcement learning–where an agent learns by experience to maximize some reward, such as points in a game–that AI companies use to shape the behavior of AI models after their initial training. That means the refinement of coding agents can be done in an automated way at scale. Assessing quality in many other domains of knowledge work is far more difficult. There are no compilers for law, no unit tests for a medical treatment plan, no definitive metric for how good a marketing campaign is before it is tested on consumers. It is much harder in other domains to gather sufficient amounts of data from professional experts about what “good” looks like. AI companies realize they have a problem gathering this kind of data. It is why they are now paying millions to companies like Mercor, which in turn are shelling out big bucks to recruit accountants, finance professionals, lawyers and doctors to help provide feedback on AI outputs so AI companies can train their models better.
It is true that there are benchmarks that show the most recent AI models making rapid progress on professional tasks outside of coding. One of the best of these is OpenAI’s GDPVal benchmark. It shows that frontier models can achieve parity with human experts across a range of professional tasks, from complex legal work to manufacturing to healthcare. So far, the results aren’t in for the models OpenAI and Anthropic released last week. But for their predecessors, Claude Opus 4.5 and GPT-5.2, the models achieve parity with human experts across a diverse range of tasks, and beat human experts in many domains.
So wouldn’t this suggest that Shumer is correct? Well, not so fast. It turns out that in many professions what “good” looks like is highly subjective. Human experts only agreed with one another on their assessment of the AI outputs about 71% of the time. The automated grading system used by OpenAI for GDPVal has even more variance, agreeing on assessments only 66% of the time. So those headlines numbers about how good AI is at professional tasks could have a wide margin of error.
Enterprises need reliability, governance, and auditability
This variance is one of the things that holds enterprises back from deploying fully automated workflows. It’s not just the output of the AI model itself might be faulty. It’s that, as the GDPVal benchmark suggests, the equivalent of an automated unit test in many professional contexts might produce an erroneous result a third of the time. Most companies cannot tolerate the possibility that poor quality work being shipped in a third of cases. The risks are simply too great. Sometimes, the risk might be merely reputational. In others, it could mean immediate lost revenue. But in many professional tasks, the consequences of a wrong decision can be even more severe: professional sanction, lawsuits, the loss of licenses, the loss of insurance cover, and, even, the risk of phyiscal harm and death—sometimes to large numbers of people. What’s more, trying to keep a human-in-the-loop to review automated outputs is problematic. Today’s AI models are genuinely getting better. Hallucinations occur less frequently. But that only makes the problem worse. As AI-generated errors become less frequent, human reviewers become complacent. AI errors become harder to spot. AI is wonderful at being confidently wrong and at presenting results that are in impeccable in form but lack substance. That bypasses some of the proxy criteria humans use to calibrate their level of vigilance. AI models often fail in ways that are alien to the ways human fail at the same tasks, which makes guarding against AI-generated errors more of a challenge.
For all these reasons, until the equivalent of software development’s automated unit tests are developed for more professional fields, deploying automated AI workflows in many knowledge work contexts will be too risky for most enterprises. AI will remain an assistant or copilot to human knowledge workers in many cases, rather than fully automating their work. There are also other reasons that the kind of automation software developers have observed are unlikely for other categories of knowledge work. In many cases, enterprises cannot give AI agents access to the kinds of tools and data systems they need to perform automated workflows. It is notable that the most enthusiastic boosters of AI automation so far have been developers who work either by themselves or for AI-native startups. These software coders are often unencumbered by legacy systems and tech debt, and often don’t have a lot of governance and compliance systems to navigate. Big organizations often currently lack ways to link data sources and software tools together. In other cases, concerns about security risks and governance mean large enterprises, especially in regulated sectors such as banking, finance, law, and healthcare, are unwilling to automate without ironclad guarantees that the outcomes will be reliable and that there is a process for monitoring, governing, and auditing the outcomes. The systems for doing this are currently primitive. Until they become much more mature and robust, don’t expect enterprises to fully automate the production of business critical or regulated outputs.
Critics say Shumer is not honest about LLM failings
I’m not the only one who found Shumer’s analysis faulty. Gary Marcus, the emeritus professor of cognitive science at New York University who has become one of the leading skeptics of today’s large language models, told me Shumer’s X post was “weaponized hype.” And he pointed to problems with even Shumer’s arguments about automated software development. “He gives no actual data to support this claim that the latest coding systems can write whole complex apps without making errors,” Marcus said. He points out that Shumer mischaracterizes a well-known benchmark from the AI evaluation organization METR that tries to measure AI models’ autonomous coding capabilities that suggests AI’s abilities are doubling every seven months. Marcus notes that Shumer fails to mention that the benchmark has two thresholds for accuracy, 50% and 80%. But most businesses aren’t interested in a system that fails half of the time, or even one that fails one out of every five attempts.
“No AI system can reliably do every five-hour long task humans can do without error, or even close, but you wouldn’t know that reading Shumer’s blog, which largely ignores all the hallucination and boneheaded errors that are so common in every day experience,” Marcus says. He also noted that Shumer didn’t cite recent research from Caltech and Stanford that chronicled a wide range of reasoning errors in advanced AI models. And he pointed out that Shumer has been caught previously making exaggerated claims about the abilities of an AI model he trained. “He likes to sell big. That doesn’t mean we should take him seriously,” Marcus said.
Other critics of Shumer’s blog point out that his economic analysis is ahistorical. Every other technological revolution has, in the long-run, created more jobs than it eliminated. Connor Boyack, president of the Libertas Institute, a policy think tank in Utah, wrote an entire counter-blog post making this argument. So, yes, AI may be poised to transform work. But the kind of full task automation that some software developers have started to observe is possible for some tasks? For most knowledge workers, especially those embedded in large organizations, that is going to take much longer than Shumer implies.
The post Matt Shumer’s viral blog about AI’s looming impact on knowledge workers is based on flawed assumptions appeared first on Fortune.




