Chan Zuckerberg Initiative’s rBio uses virtual cells to train AI, bypassing lab work

The Chan Zuckerberg Initiative announced Thursday the launch of rBio, the first artificial intelligence model trained to reason about cellular biology using virtual simulations rather than requiring expensive laboratory experiments — a breakthrough that could dramatically accelerate biomedical research and drug discovery.

The reasoning model, detailed in a research paper published on bioRxiv, demonstrates a novel approach called “soft verification” that uses predictions from virtual cell models as training signals instead of relying solely on experimental data. This paradigm shift could help researchers test biological hypotheses computationally before committing time and resources to costly laboratory work.

“The idea is that you have these super powerful models of cells, and you can use them to simulate outcomes rather than testing them experimentally in the lab,” said Ana-Maria Istrate, senior research scientist at CZI and lead author of the research, in an interview. “The paradigm so far has been that 90% of the work in biology is tested experimentally in a lab, while 10% is computational. With virtual cell models, we want to flip that paradigm.”

How AI finally learned to speak the language of living cells

The announcement represents a significant milestone for CZI’s ambitious goal to “cure, prevent, and manage all disease by the end of this century.” Under the leadership of pediatrician Priscilla Chan and Meta CEO Mark Zuckerberg, the $6 billion philanthropic initiative has increasingly focused its resources on the intersection of artificial intelligence and biology.

rBio addresses a fundamental challenge in applying AI to biological research. While large language models like ChatGPT excel at processing text, biological foundation models typically work with complex molecular data that cannot be easily queried in natural language. Scientists have struggled to bridge this gap between powerful biological models and user-friendly interfaces.

“Foundation models of biology — models like GREmLN and TranscriptFormer — are built on biological data modalities, which means you cannot interact with them in natural language,” Istrate explained. “You have to find complicated ways to prompt them.”

The new model solves this problem by distilling knowledge from CZI’s TranscriptFormer — a virtual cell model trained on 112 million cells from 12 species spanning 1.5 billion years of evolution — into a conversational AI system that researchers can query in plain English.

The ‘soft verification’ revolution: Teaching AI to think in probabilities, not absolutes

The core innovation lies in rBio’s training methodology. Traditional reasoning models learn from questions with unambiguous answers, like mathematical equations. But biological questions involve uncertainty and probabilistic outcomes that don’t fit neatly into binary categories.

CZI’s research team, led by Senior Director of AI Theofanis Karaletsos and Istrate, overcame this challenge by using reinforcement learning with proportional rewards. Instead of simple yes-or-no verification, the model receives rewards proportional to the likelihood that its biological predictions align with reality, as determined by virtual cell simulations.

“We applied new methods to how LLMs are trained,” the research paper explains. “Using an off-the-shelf language model as a scaffold, the team trained rBio with reinforcement learning, a common technique in which the model is rewarded for correct answers. But instead of asking a series of yes/no questions, the researchers tuned the rewards in proportion to the likelihood that the model’s answers were correct.”

This approach allows scientists to ask complex questions like “Would suppressing the actions of gene A result in an increase in activity of gene B?” and receive scientifically grounded responses about cellular changes, including shifts from healthy to diseased states.

Beating the benchmarks: How rBio outperformed models trained on real lab data

In testing against the PerturbQA benchmark — a standard dataset for evaluating gene perturbation prediction — rBio demonstrated competitive performance with models trained on experimental data. The system outperformed baseline large language models and matched performance of specialized biological models in key metrics.

Particularly noteworthy, rBio showed strong “transfer learning” capabilities, successfully applying knowledge about gene co-expression patterns learned from TranscriptFormer to make accurate predictions about gene perturbation effects—a completely different biological task.

“We show that on the PerturbQA dataset, models trained using soft verifiers learn to generalize on out-of-distribution cell lines, potentially bypassing the need to train on cell-line specific experimental data,” the researchers wrote.

When enhanced with chain-of-thought prompting techniques that encourage step-by-step reasoning, rBio achieved state-of-the-art performance, surpassing the previous leading model SUMMER.

From social justice to science: Inside CZI’s controversial pivot to pure research

The rBio announcement comes as CZI has undergone significant organizational changes, refocusing its efforts from a broad philanthropic mission that included social justice and education reform to a more targeted emphasis on scientific research. The shift has drawn criticism from some former employees and grantees who saw the organization abandon progressive causes.

However, for Istrate, who has worked at CZI for six years, the focus on biological AI represents a natural evolution of long-standing priorities. “My experience and work has not changed much. I have been part of the science initiative for as long as I have been at CZI,” she said.

The concentration on virtual cell models builds on nearly a decade of foundational work. CZI has invested heavily in building cell atlases — comprehensive databases showing which genes are active in different cell types across species — and developing the computational infrastructure needed to train large biological models.

“I’m really excited about the work that’s been happening at CZI for years now, because we’ve been building up to this moment,” Istrate noted, referring to the organization’s earlier investments in data platforms and single-cell transcriptomics.

Building bias-free biology: How CZI curated diverse data to train fairer AI models

One critical advantage of CZI’s approach stems from its years of careful data curation. The organization operates CZ CELLxGENE, one of the largest repositories of single-cell biological data, where information undergoes rigorous quality control processes.

“We’ve generated some of the flagship initial data atlases for transcriptomics, and those were generated with diversity in mind to minimize bias in terms of cell types, ancestry, tissues, and donors,” Istrate explained.

This attention to data quality becomes crucial when training AI models that could influence medical decisions. Unlike some commercial AI efforts that rely on publicly available but potentially biased datasets, CZI’s models benefit from carefully curated biological data designed to represent diverse populations and cell types.

Open source vs. big tech: Why CZI is giving away billion-dollar AI technology for free

CZI’s commitment to open-source development distinguishes it from commercial competitors like Google DeepMind and pharmaceutical companies developing proprietary AI tools. All CZI models, including rBio, are freely available through the organization’s Virtual Cell Platform, complete with tutorials that can run on free Google Colab notebooks.

“I do think the open source piece is very important, because that’s a core value that we’ve had since we’ve started CZI,” Istrate said. “One of the main goals for our work is to accelerate science. So everything we do is we want to make it open source for that purpose only.”

This strategy aims to democratize access to sophisticated biological AI tools, potentially benefiting smaller research institutions and startups that lack the resources to develop such models independently. The approach reflects CZI’s philanthropic mission while creating network effects that could accelerate scientific progress.

The end of trial and error: How AI could slash drug discovery from decades to years

The potential applications extend far beyond academic research. By enabling scientists to quickly test hypotheses about gene interactions and cellular responses, rBio could significantly accelerate the early stages of drug discovery — a process that typically takes decades and costs billions of dollars.

The model’s ability to predict how gene perturbations affect cellular behavior could prove particularly valuable for understanding neurodegenerative diseases like Alzheimer’s, where researchers need to identify how specific genetic changes contribute to disease progression.

“Answers to these questions can shape our understanding of the gene interactions contributing to neurodegenerative diseases like Alzheimer’s,” the research paper notes. “Such knowledge could lead to earlier intervention, perhaps halting these diseases altogether someday.”

The universal cell model dream: Integrating every type of biological data into one AI brain

rBio represents the first step in CZI’s broader vision to create “universal virtual cell models” that integrate knowledge from multiple biological domains. Currently, researchers must work with separate models for different types of biological data—transcriptomics, proteomics, imaging—without easy ways to combine insights.

“One of the grand challenges in building these virtual cell models and understanding cells, as I mentioned over the past couple over the next couple of years, is how to integrate knowledge from all of these super powerful models of biology,” Istrate said. “The main challenge is, how do you integrate all of this knowledge into one space?”

The researchers demonstrated this integration capability by training rBio models that combine multiple verification sources — TranscriptFormer for gene expression data, specialized neural networks for perturbation prediction, and knowledge databases like Gene Ontology. These combined models significantly outperformed single-source approaches.

The roadblocks ahead: What could stop AI from revolutionizing biology

Despite its promising performance, rBio faces several technical challenges. The model’s current expertise focuses primarily on gene perturbation prediction, though the researchers indicate that any biological domain covered by TranscriptFormer could theoretically be incorporated.

The team continues working on improving the user experience and implementing appropriate guardrails to prevent the model from providing answers outside its area of expertise—a common challenge in deploying large language models for specialized domains.

“While rBio is ready for research, the model’s engineering team is continuing to improve the user experience, because the flexible problem-solving that makes reasoning models conversational also poses a number of challenges,” the research paper explains.

The trillion-dollar question: How open source biology AI could reshape the pharmaceutical industry

The development of rBio occurs against the backdrop of intensifying competition in AI-driven drug discovery. Major pharmaceutical companies and technology firms are investing billions in biological AI capabilities, recognizing the potential to transform how medicines are discovered and developed.

CZI’s open-source approach could accelerate this transformation by making sophisticated tools available to the broader research community. Academic researchers, biotech startups, and even established pharmaceutical companies can now access capabilities that would otherwise require substantial internal AI development efforts.

The timing proves significant as the Trump administration has proposed substantial cuts to the National Institutes of Health budget, potentially threatening public funding for biomedical research. CZI’s continued investment in biological AI infrastructure could help maintain research momentum during periods of reduced government support.

A new chapter in the race against disease

rBio’s launch marks more than just another AI breakthrough—it represents a fundamental shift in how biological research could be conducted. By demonstrating that virtual simulations can train models as effectively as expensive laboratory experiments, CZI has opened a path for researchers worldwide to accelerate their work without the traditional constraints of time, money, and physical resources.

As CZI prepares to make rBio freely available through its Virtual Cell Platform, the organization continues expanding its biological AI capabilities with models like GREmLN for cancer detection and ongoing work on imaging technologies. The success of the soft verification approach could influence how other organizations train AI for scientific applications, potentially reducing dependence on experimental data while maintaining scientific rigor.

For an organization that began with the audacious goal of curing all diseases by the century’s end, rBio offers something that has long eluded medical researchers: a way to ask biology’s hardest questions and get scientifically grounded answers in the time it takes to type a sentence. In a field where progress has traditionally been measured in decades, that kind of speed could make all the difference between diseases that define generations—and diseases that become distant memories.

The post Chan Zuckerberg Initiative’s rBio uses virtual cells to train AI, bypassing lab work appeared first on Venture Beat.