Inside Fei-Fei Li’s Plan to Build AI-Powered Virtual Worlds

Recent AI progress has followed a pattern. Across text, image, audio, and video, once the right technical foundations were discovered, it only took a few years for AI-generated outputs to go from merely passable to indistinguishable from human creation. Although it’s early, recent advances suggest that virtual worlds—3D environments you can explore and interact with—could be next.

This is the bet being made by pioneering AI researcher Fei-Fei Li, often called AI’s “godmother” for her contributions to computer vision. In November, her new startup, World Labs, launched its first commercial offering: a platform called Marble, where users can conjure exportable 3D environments from text, image, or video prompts.

[time-brightcove not-tgx=”true”]

The platform could prove immediately useful for design professionals, allowing some technically complex creative work to be automated. But Li’s end goal is much more ambitious: to create not just virtual worlds but what she calls “spatial intelligence,” or, per her recent manifesto, “the frontier beyond language—the capability that links imagination, perception and action.” AI systems can already see the world—with spatial intelligence, she argues, they could begin to meaningfully interact with it.

Worlds on demand

While virtual worlds already exist in the form of video games we engage with through screens or headsets, creating them is technically complex and labor-intensive. With AI, virtual worlds could be created much more easily, personalized to their users, and made to expand infinitely—at least in theory.

In practice, world models—including those from other companies, like Google DeepMind’s Genie 3—are still early relative to their potential. Ben Mildenhall, one of Li’s co-founders at World Labs, says he expects them to follow the same trajectory we’ve seen with text, audio, and video—people moving from “that’s cute” to “that’s interesting” to “I didn’t realize that was made by AI.”

Indeed, AI video generation models have rapidly improved. This improvement is behind the recent viral success of models from OpenAI and Midjourney. Companies like Captions, Runway, and Synthesia have all built businesses around AI-generated video as well. According to Vincent Sitzmann, an assistant professor at MIT and expert on AI world modeling, we can think of video models as “proto-world models.”

Li’s latest platform, Marble, offers various ways to create. You can prompt it with a written description, or with photos, videos, or an existing 3D scene, and it’ll spit out a “world” you can navigate from a first-person perspective, as in a video game. But these worlds—static at first, although developers can add motion and more using specialized tools—have clear limits. It only takes a few beats of exploration before visuals begin to distort and the world assumes a hallucinatory, incoherent structure.

Modeling entire worlds is much harder than generating videos. Mildenhall argues that because there’s a much higher barrier to entry for creating 3D worlds than for writing words, you start to see “glimmers of value” from tools like Marble much earlier. “World Labs has shown what’s possible if you integrate and scale a bunch of the breakthroughs the computer vision community has had over the last decade—it’s a very impressive achievement,” says Sitzmann. “For the first time, you get a glimpse of what kinds of products might be possible with this.”

Li says that “we can use this technology to create many virtual worlds that connect, extend, or complement our physical world.” The case for using world models to create new entertainment experiences is clear enough. And in domains like architecture and engineering, “you can try a thousand times, exploring many potential alternatives at a much lower cost,” says Mildenhall. But for their other touted use cases—robotics, science, and education—major hurdles remain.

A way to go

While we have a plethora of video and camera data with which to train video models, the right training data for robots—particularly humanoid robots—is much harder to come by. We lack proprioceptive or “action data,” says Sitzmann, which would tell a robot which motor movements correspond to physical actions.

For self-driving cars, which have only a few inputs—gears, pedals, and a steering wheel—we can “collect millions of hours of video which is matched with the actions that human drivers took. But a humanoid robot has all these other joints and actions that they can take. And we don’t have data for that,” he says.

In her manifesto, Li argues that world models will play a “defining role” in solving the data problem for robotics. While the manifesto lays out a vision, Sitzmann says it’s “not really answering the question” of how exactly world models will solve robotics in the future, since a faithful simulator would require data that correlates movement to action, which we currently lack.

There are also challenges when it comes to using world models for science and education. For entertainment, it’s sufficient if things look realistic. But for science and education, faithfulness to the real-world dynamics being simulated is arguably more important. “I [could] walk in and experience the inside of a cell,” or “if I’m a surgeon being trained to do laparoscopic surgery, I [could be] inside an intestine,” says Li, discussing what future world models could offer. But of course, a simulation of a cell or a surgery is only useful to the extent that it is accurate. World Labs’ founders are acutely aware of the trade-offs between realism and faithfulness, and are optimistic that at some point, models will be good enough to provide both.

What if it works?

Compared with language, “spatial reasoning is way worse in today’s AI,” says Li. True enough. But while Li is betting that solving spatial intelligence (as her company defines it) is necessary for AI to advance beyond a certain point—a trillion dollar concern—whether that holds remains to be seen. Whether existing multimodal language models like ChatGPT will “hit a wall” and suddenly stop improving is also an open question. What we do know is that, across the industry and across modalities, the models are improving.

Mildenhall imagines we’ll get to a point where “you can experience anything you can experience in reality within a model.” In such a world, you could “multimodally engage with the thing and transform it at your will to any impulse you have,” he says.

With reasoning models and virtual reality improving in parallel, one can imagine a strange future, where we each have access to our own infinitely expansive and engaging generative worlds. Instead of watching a TikTok video of a cat, a cat is right in front of you. Instead of scrolling, exploring. Such a world would bend to your will. Some users might fall in love with it, as they fall in love with chatbots today. “We’re currently not at that level,” says Christoph Lassner, another World Labs co-founder. Sitzmann agrees that the idea is “not crazy,” although he notes that prohibitive costs and extensive rendering times suggest such a future is still relatively far away.

Li is emphatic that this technology will augment and benefit humans, and that our relationship to it will remain collaborative. Why? “Because I believe in humanity,” she says. “If you look at the arc of history, civilization progresses, and our knowledge increases.” She rejects both utopian and dystopian visions. “I think all of us have a responsibility in ushering AI to a better state as it becomes more powerful,” she says. “All of us should want humanity to prevail and thrive. So where your hope lies should be where your actions go.”

The post Inside Fei-Fei Li’s Plan to Build AI-Powered Virtual Worlds appeared first on TIME.