Last month, along with a comprehensive suite of new AI tools and innovations, Google DeepMind unveiled Gemini Diffusion. This experimental research model uses a diffusion-based approach to generate text. Traditionally, large language models (LLMs) like GPT and Gemini itself have relied on autoregression, a step-by-step approach where each word is generated based on the previous one. Diffusion language models (DLMs), also known as diffusion-based large language models (dLLMs), leverage a method more commonly seen in image generation, starting with random noise and gradually refining it into a coherent output. This approach dramatically increases generation speed and can improve coherency and consistency.
Gemini Diffusion is currently available as an experimental demo; sign up for the waitlist here to get access.
(We will have a session at VB Transform in SF on June 24 that explores the process of going from prototype to production. LinkedIn leaders will share how they scaled their hiring assistants with multi-agent LLM systems. Register today.)
Understanding diffusion vs. autoregression
Diffusion and autoregression are fundamentally different approaches. The autoregressive approach generates text sequentially, with tokens predicted one at a time. While this method ensures strong coherence and context tracking, it can be computationally intensive and slow, especially for long-form content.
Diffusion models, by contrast, begin with random noise, which is gradually denoised into a coherent output. When applied to language, the technique has several advantages. Blocks of text can be processed in parallel, potentially producing entire segments or sentences at a much higher rate.
Gemini Diffusion can reportedly generate 1,000-2,000 tokens per second. In contrast, Gemini 2.5 Flash has an average output speed of 272.4 tokens per second. Additionally, mistakes in generation can be corrected during the refining process, improving accuracy and reducing the number of hallucinations. There may be trade-offs in terms of fine-grained accuracy and token-level control; however, the increase in speed will be a game-changer for numerous applications.
How does diffusion-based text generation work?
During training, DLMs work by gradually corrupting a sentence with noise over many steps, until the original sentence is rendered entirely unrecognizable. The model is then trained to reverse this process, step by step, reconstructing the original sentence from increasingly noisy versions. Through the iterative refinement, it learns to model the entire distribution of plausible sentences in the training data.
While the specifics of Gemini Diffusion have not yet been disclosed, the typical training methodology for a diffusion model involves these key stages:
Forward diffusion: With each sample in the training dataset, noise is added progressively over multiple cycles (often 500 to 1,000) until it becomes indistinguishable from random noise.
Reverse diffusion: The model learns to reverse each step of the noising process, essentially learning how to “denoise” a corrupted sentence one stage at a time, eventually restoring the original structure.
This process is repeated millions of times with diverse samples and noise levels, enabling the model to learn a reliable denoising function.
Once trained, the model is capable of generating entirely new sentences. DLMs generally require a condition or input, such as a prompt, class label, or embedding, to guide the generation towards desired outcomes. The condition is injected into each step of the denoising process, which shapes an initial blob of noise into structured and coherent text.
Advantages and disadvantages of diffusion-based models
In an interview with VentureBeat, Brendan O’Donoghue, research scientist at Google DeepMind and one of the leads on the Gemini Diffusion project, elaborated on some of the advantages of diffusion-based techniques when compared to autoregression. According to O’Donoghue, the major advantages of diffusion techniques are the following:
- Lower latencies: Diffusion models can produce a sequence of tokens in much less time than autoregressive models.
- Adaptive computation: Diffusion models will converge to a sequence of tokens at different rates depending on the task’s difficulty. This allows the model to consume fewer resources (and have lower latencies) on easy tasks and more on harder ones.
- Non-causal reasoning: Due to the bidirectional attention in the denoiser, tokens can attend to future tokens within the same generation block. This allows non-causal reasoning to take place and allows the model to make global edits within a block to produce more coherent text.
- Iterative refinement / self-correction: The denoising process involves sampling, which can introduce errors just like in autoregressive models. However, unlike autoregressive models, the tokens are passed back into the denoiser, which then has an opportunity to correct the error.
O’Donoghue also noted the main disadvantages: “higher cost of serving and slightly higher time-to-first-token (TTFT), since autoregressive models will produce the first token right away. For diffusion, the first token can only appear when the entire sequence of tokens is ready.”
Performance benchmarks
Google says Gemini Diffusion’s performance is comparable to Gemini 2.0 Flash-Lite.
* Non-agentic evaluation (single turn edit only), max prompt length of 32K.
The two models were compared using several benchmarks, with scores based on how many times the model produced the correct answer on the first try. Gemini Diffusion performed well in coding and mathematics tests, while Gemini 2.0 Flash-lite had the edge on reasoning, scientific knowledge, and multilingual capabilities.
As Gemini Diffusion evolves, there’s no reason to think that its performance won’t catch up with more established models. According to O’Donoghue, the gap between the two techniques is “essentially closed in terms of benchmark performance, at least at the relatively small sizes we have scaled up to. In fact, there may be some performance advantage for diffusion in some domains where non-local consistency is important, for example, coding and reasoning.”
Testing Gemini Diffusion
VentureBeat was granted access to the experimental demo. When putting Gemini Diffusion through its paces, the first thing we noticed was the speed. When running the suggested prompts provided by Google, including building interactive HTML apps like Xylophone and Planet Tac Toe, each request completed in under three seconds, with speeds ranging from 600 to 1,300 tokens per second.
To test its performance with a real-world application, we asked Gemini Diffusion to build a video chat interface with the following prompt:
Build an interface for a video chat application. It should have a preview window that accesses the camera on my device and displays its output. The interface should also have a sound level meter that measures the output from the device's microphone in real time.
In less than two seconds, Gemini Diffusion created a working interface with a video preview and an audio meter.
Though this was not a complex implementation, it could be the start of an MVP that can be completed with a bit of further prompting. Note that Gemini 2.5 Flash also produced a working interface, albeit at a slightly slower pace (approximately seven seconds).
Gemini Diffusion also features “Instant Edit,” a mode where text or code can be pasted in and edited in real-time with minimal prompting. Instant Edit is effective for many types of text editing, including correcting grammar, updating text to target different reader personas, or adding SEO keywords. It is also useful for tasks such as refactoring code, adding new features to applications, or converting an existing codebase to a different language.
Enterprise use cases for DLMs
It’s safe to say that any application that requires a quick response time stands to benefit from DLM technology. This includes real-time and low-latency applications, such as conversational AI and chatbots, live transcription and translation, or IDE autocomplete and coding assistants.
According to O’Donoghue, with applications that leverage “inline editing, for example, taking a piece of text and making some changes in-place, diffusion models are applicable in ways autoregressive models aren’t.” DLMs also have an advantage with reason, math, and coding problems, due to “the non-causal reasoning afforded by the bidirectional attention.”
DLMs are still in their infancy; however, the technology can potentially transform how language models are built. Not only do they generate text at a much higher rate than autoregressive models, but their ability to go back and fix mistakes means that, eventually, they may also produce results with greater accuracy.
Gemini Diffusion enters a growing ecosystem of DLMs, with two notable examples being Mercury, developed by Inception Labs, and LLaDa, an open-source model from GSAI. Together, these models reflect the broader momentum behind diffusion-based language generation and offer a scalable, parallelizable alternative to traditional autoregressive architectures.
The post Beyond GPT architecture: Why Google’s Diffusion approach could reshape LLM deployment appeared first on Venture Beat.