Can Science Predict When a Study Won’t Hold Up?

Scientists publish more than 10 million studies and other publications a year. Some of those findings will add to humanity’s storehouse of knowledge. But some will be wrong.

To assess a study, scientists can replicate it to see if they get the same result. But seven years ago, a team of hundreds of scientists set out to find a faster way to judge new scientific literature. They built artificial intelligence systems to predict whether studies would hold up to scrutiny.

The project, funded by the Defense Advanced Research Projects Agency, or DARPA, was called Systematizing Confidence in Open Research and Evidence — SCORE, for short. The idea came from Adam Russell, then a program manager for the agency. He envisioned generating a kind of credit score for science.

“People can say, ‘Hey, this is likely to be robust, we can premise a policy on it,’” said Dr. Russell, who is now at the University of Southern California. “‘But this? Nah, this might make for a book in the airport.’”

The SCORE team inspected hundreds of studies, running many of them again, to better understand what makes research hold up. Now it is publishing a raft of papers on those efforts.

For now, a scientific credit score remains a dream, the researchers say. Artificial intelligence cannot make reliable predictions.

“We’re not there yet,” said Brian Nosek, the executive director of the Center for Open Science and a leader of the project. “It’s picking up some kind of signal, but it would have to get a lot more accurate to use on its own.”

But along the way, outside experts said, the SCORE team has made a remarkably deep dive into the scientific process, uncovering clues that could help improve it.

“I don’t think there’s ever been anything on this scale before,” said Dorothy Bishop, a psychologist at the University of Oxford who was not part of the effort.

See for yourself

Replicating research has been a mainstay of science for generations. In 1953, scientists were startled when Clair Patterson, a geochemist at Caltech, used a new technique to determine that Earth is 4.5 billion years old — 1.2 billion years older than previous estimates.

“I had some of the best, most able critics in the world trying to destroy my number,” Dr. Patterson later recalled. “They worked their hearts out to prove I was wrong.” Try as they might, however, his number stuck.

But sometimes replications do not agree. In 1976, archaeologists discovered an ancient hunting camp in Monte Verde, Chile, and determined that it was about 14,500 years old, much older than previously discovered evidence of people in the Americas.

Almost 50 years passed before an independent team of scientists replicated the study. Last month, they reached a very different conclusion: People lived at Monte Verde sometime from 4,200 to 8,200 years ago.

The authors of the original study dispute the new finding; more research will probably be needed to resolve the conflict. That’s how science corrects itself.

At least, that’s how it’s supposed to work. But replicating previous research takes time and money that researchers might prefer to spend on their own studies. And journal editors often yawn at replication.

Melanie Mitchell, an artificial intelligence researcher at the Santa Fe Institute in New Mexico, recently replicated an A.I. paper and failed to match the original results. A journal rejected her paper on the grounds that it lacked novelty.

“I really hate this kind of culture,” Dr. Mitchell told a lecture audience at Yale last month.

Solving a ‘wicked problem’

For more than 15 years, some scientists have been trying to change the culture. They started by documenting the extent of the problem. In the early 2010s, Dr. Nosek and colleagues replicated 100 psychology papers — and matched the original results only 39 percent of the time.

In another project, Dr. Nosek teamed up with cancer biologists to replicate 50 experiments on animals and human cells. Fewer than half of the results withstood their scrutiny.

Dr. Russell, at DARPA, wondered if scientists could use artificial intelligence to predict the trustworthiness of a study. But first scientists would have to gather much more data on replication. “I knew this was a wicked problem,” he said.

The SCORE project started in 2019 and grew to include 865 researchers. They analyzed 3,900 papers published from 2009 to 2018 from fields across the social sciences, such as criminology, economics, psychology and sociology.

In one line of research, the SCORE team replicated 164 of the studies. Team members reran some experiments, recruiting volunteers to take the original tests again. For studies based on government statistics, SCORE team members got hold of their own data and analyzed it.

Only about half of the replicated studies yielded the same results as the originals.

Tim Parker, a biologist at Whitman College who was not involved in the research, said that the low rate was in line with earlier, smaller studies.

“I think they’re very convincing results,” he said. “And I would hope that people who were not persuaded by previous empirical evidence would be more persuaded by this.”

Dr. Parker and other researchers have raised concerns about how scientists use different methods to study the same data. Even if the methods are all legitimate, they might lead to conflicting results, they argue.

The SCORE team measured how robust research findings were when scientists used different methods. Members picked 100 papers and assigned at least five teams of experts to each one. Each team applied its own methods to analyze the original data.

“A lot of times, those choices are consequential,” Dr. Nosek said. In only about 57 percent of the SCORE trials did all five teams get roughly the same result as the original study. They obtained precisely the same result only one-third of the time.

The SCORE team also considered how problematic data, as well as problems in the computer programs used for analysis, can lead to replication failures.

The researchers analyzed the data in 143 papers utilizing the same code used by the original authors. About 9 percent of the SCORE results were completely different from the original ones; another 14 percent were only roughly the same.

Abel Brodeur, an economist at the University of Ottawa, said that he had encountered similar problems in his own science-testing project, the Institute for Replication. These failures can arise when scientists make mistakes while formatting their data or writing their programs. “Sometimes the coding errors are crazy,” he said.

The trouble may actually be worse than the SCORE study suggests, because scientists often fail to share their data and code. When the SCORE team had to write its own code to analyze data, it reproduced exactly the same results less than half of the time.

Dr. Russell had hoped that artificial intelligence systems could train on findings from SCORES to learn the telltale signs of a paper that will or won’t replicate successfully. But the mystery of replication still seems too deep; the predictions from A.I. aren’t entirely random, but they’re far from perfect, the SCORE team found.

“It’s still not that impressive,” said Andrew Tyner, a principal research scientist at the Center for Open Science and an author of the new studies. “But there might be some there there.”

That doesn’t mean experts can trust their own instincts, however. The SCORE project enlisted hundreds of experts to predict whether papers would replicate successfully. Reviewing 132 replications, the SCORE team found that the experts guessed correctly about three-quarters of the time.

For Dr. Nosek, the chief value of SCORE has been to demonstrate how complex the scientific process is, and to highlight ways to improve it.

For instance, scientists can announce the plan for an experiment in advance, which discourages them from tweaking their hypothesis to fit the data they ultimately get.

Dr. Brodeur said that journals can help by requiring authors to share original data and code. “People have cleaned up their mess,” he said.

Dr. Jay Bhattacharya, the director of the National Institutes of Health, said in an interview that the agency was working on ways to improve replication.

“Science determines what’s true based on replication,” he said. “I don’t feel it’s working well right now.”

Starting this year, the agency plans to share new tools for sharing data and code. It will also identify key ideas in different fields and award grants to replicate them. And the agency is developing a journal that Dr. Bhattacharya described as “a place where you can publish your replication efforts and get credit for it.”

Jeremy Berg, a biochemist at the University of Pittsburgh School of Medicine, as well as former director of the National Institute of General Medical Sciences and a critic of Dr. Bhattacharya, characterized his plans as “painfully naïve.”

Dr. Berg warned that projects like centralized data platforms and replications of key ideas would work only if the government made expensive, long-term commitments to them. Simply offering more opportunities to publish replication studies won’t, on its own, make universities value them when deciding about hiring and tenure.

“I don’t think anybody’s cracked the code on that,” he said.

Dr. Nosek cautioned that no matter how much care researchers put into their work, they will still sometimes turn out to be wrong.

“It’s hard at the frontier of knowledge, and it doesn’t matter what questions you’re working on,” Dr. Nosek said. “There are lots of false starts, and lots of things that don’t make sense.”

Carl Zimmer covers news about science for The Times and writes the Origins column.

The post Can Science Predict When a Study Won’t Hold Up? appeared first on New York Times.

Can Science Predict When a Study Won’t Hold Up?

Secretive US military moves ‘tell a lot’ about impending Trump address: analyst

Trump drops blatantly false outburst after storming out of Supreme Court hearing

Effort to bring back Voice of America staffers paused, pending appeal

The BBC Has Pulled Some Wild April Fool’s Day Pranks on Its Viewers Over the Years

I tried 38 of Trader Joe’s seasonal spring products, and there are only a few I wouldn’t buy again

Retirees receive six times as much in federal dollars as young people

Unsealed warrants reveal $200-million trust was on the line when SoCal farmer’s wife was killed

Birthright Citizenship Plan Faces Costly Verification Hurdles