The tasks resemble those that lawyers, doctors, financial analysts, and management consultants solve for a living. One asks for a diagnosis of a six-year-old patient based on nine pieces of multimedia evidence; another asks for legal advice on a musician’s estate; a third calls for a valuation of part of a healthcare technology company.
Mercor, which claims to supply “expert data” to every top AI company, says that it spent more than $500,000 to develop 200 tasks that test whether AIs “can perform knowledge work with high economic value” across law, medicine, finance, and management consulting. The resulting AI Productivity Index (APEX), published Wednesday, lists among its co-authors a former global managing director of McKinsey, a former dean of Harvard Business School, and a Harvard Law School professor, who advised on the design and scope of the tasks in their respective domains, according to Mercor. APEX is “focused on going very deep,” says Brendan Foody, the company’s 22-year-old CEO. “How do we get very comprehensive about what it means to be a consultant or a banker or a doctor or lawyer?”
To create the tasks, Mercor contracted white-collar professionals whose former employers include top banks (Goldman Sachs, JPMorgan), consulting firms (McKinsey, Boston Consulting Group), law firms (Latham & Watkins) and hospitals (Mount Sinai). They average 7.25 years of professional experience, and their pay at Mercor is competitive with their previous, highly prestigious employers. Mercor’s website advertises an average hourly rate of $81 per hour, reaching over $200 per hour—equivalent to an annual salary of about $400,000—for “Senior Domain Experts,” who require at least four years’ professional experience to apply.
“It’s hard to imagine a better hourly job from a pay perspective,” says Matt Seck, a former investment banking analyst at Bank of America, who is contracted by Mercor to write finance tasks similar to those included in the paper.
Benchmarks have long been used to assess AI capability, but directly quantifying AI models’ ability to do economically useful work represents a “paradigm shift,” says Osvald Nitski, one of the paper’s authors. On Mercor’s benchmark, “getting 100% would mean that you’d basically have an analyst or an associate in a box that you could go and send tasks to, and then they deliver it to the requirements of a partner, or an MD, or whoever would be grading the work of that person,” says Nitski.
The models aren’t there yet, but they are improving fast. OpenAI’s GPT-4o, released in May 2024, scored 35.9% on the benchmark. GPT-5, released just over a year later, achieved 64.2%—the top score on the benchmark. Getting 64.2% on the benchmark doesn’t mean that GPT-5 is delivering 64.2% of the value of a human worker—work that doesn’t hit 100% “might be effectively useless,” write the paper authors. GPT-5 only got full marks in two out of the 200 tasks—one in law and one in investment banking—which “primarily involve basic reasoning, simple calculations, and a lot of basic information searching,” according to Mercor.
Even if a model hits 100% on Mercor’s benchmark, it would probably make a poor substitute for human professionals. The tasks in Mercor’s benchmark focus on “well scoped deliverables,” such as making diagnoses or building financial models, rather than more open-ended tasks which might admit multiple right answers. This requires that the task descriptions include numerous assumptions needed to ensure that the desired output is well specified. The AIs’ outputs are entirely text-based, meaning that the benchmark doesn’t test AIs’ ability to use a computer, the way that a human worker would. (Mercor says that future versions of APEX will address these limitations.) And drafting the lengthy prompts needed for models to complete the tasks “would be more tedious than just doing it yourself,” says Seck.
Still, there are signs that AI models are becoming competitive with humans. Another benchmark, published Thursday, Sept. 25, by OpenAI, showed that expert human evaluators preferred an AI’s work to human work 47.6% of the time on 220 tasks including designing a sales brochure for a property and assessing images of a skin lesion. OpenAI also found that the performance of its models has increased substantially in a short space of time, more than doubling in their “win rate” against humans between June 2024 and Sept. 2025.
As model capability has grown, so has the complexity of the tasks that they’re being tested on and the human skill needed to create sufficiently challenging tasks. Earlier tests measured relatively abstract capabilities on reasoning puzzles and exam-style questions. Benchmarks before the 2022 release of ChatGPT, often sourced data from crowdworker services, which paid workers a few dollars an hour. By 2023, Ph.D. students were being asked to create challenging multiple-choice questions in biology, physics and chemistry. In September, xAI reportedly laid off 500 of its “generalist” data workers as part of an “expansion and prioritization” of the company’s “specialist” data workers. To be sure, low-paid data workers still contribute to the development of AI models, but the upper bound of skill and compensation needed to develop AI benchmarks is increasing rapidly.
Directly measuring the utility of AI models on economically valuable tasks is “very hard to pull off,” says Nitski. The success criteria in domains such as finance and consulting are harder to define than, for example, in software engineering. Even with the perfect criteria in hand, marking an AI’s output on a large scale is harder than in software engineering, where automated tests can check whether a piece of code runs correctly. This explains, in part, why tests aiming to measure the real-world utility of AI models have existed for software engineering since at least 2023, but have lagged in other white-collar domains. However, as AIs have improved, they have helped solve the problem of grading complex tasks. The success criteria for Mercor’s tasks are written by human experts, but the marking is done by AIs, which Mercor says agreed with human graders 89% of the time, helping to scale the evaluations.
Developing benchmarks isn’t just about knowing how good models are. In AI, as in business, “what gets measured gets done”—good tests often precipitate AI progress on those tests. “It’s ultimately the same data type for both evaluation and training,” says Foody. Evaluating performance in games such as Go is straightforward; AI was beating Go masters by 2016. In 2023, benchmarks began evaluating AIs on real-world tasks in software engineering. Two years later, the labor statistics for junior programmers look dubious.
“AI got its Ph.D.,” says Foody. “Now it’s starting to enter the job market.”
The post AI Is Learning to Do the Jobs of Doctors, Lawyers, and Consultants appeared first on TIME.