Researchers from UCLA and Meta AI have introduced d1, a novel framework using reinforcement learning (RL) to significantly enhance the reasoning capabilities of diffusion-based large language models (dLLMs). While most attention has focused on autoregressive models like GPT, dLLMs offer unique advantages. Giving them strong reasoning skills could unlock new efficiencies and applications for enterprises.
dLLMs represent a distinct approach to generating text compared to standard autoregressive models, potentially offering benefits in terms of efficiency and information processing, which could be valuable for various real-world applications.
Understanding diffusion language models
Most large language models (LLMs) like GPT-4o and Llama are autoregressive (AR). They generate text sequentially, predicting the next token based only on the tokens that came before it.
Diffusion language models (dLLMs) work differently. Diffusion models were initially used in image generation models like DALL-E 2, Midjourney and Stable Diffusion. The core idea involves gradually adding noise to an image until it’s pure static, and then training a model to meticulously reverse this process, starting from noise and progressively refining it into a coherent picture.
Adapting this concept directly to language was tricky because text is made of discrete units (tokens), unlike the continuous pixel values in images. Researchers overcame this by developing masked diffusion language models. Instead of adding continuous noise, these models work by randomly masking out tokens in a sequence and training the model to predict the original tokens.
This leads to a different generation process compared to autoregressive models. dLLMs start with a heavily masked version of the input text and gradually “unmask” or refine it over several steps until the final, coherent output emerges. This “coarse-to-fine” generation enables dLLMs to consider the entire context simultaneously at each step, as opposed to focusing solely on the next token.
This difference gives dLLMs potential advantages, such as improved parallel processing during generation, which could lead to faster inference, especially for longer sequences. Examples of this model type include the open-source LLaDA and the closed-source Mercury model from Inception Labs.
“While autoregressive LLMs can use reasoning to enhance quality, this improvement comes at a severe compute cost with frontier reasoning LLMs incurring 30+ seconds in latency to generate a single response,” Aditya Grover, assistant professor of computer science at UCLA and co-author of the d1 paper, told VentureBeat. “In contrast, one of the key benefits of dLLMs is their computational efficiency. For example, frontier dLLMs like Mercury can outperform the best speed-optimized autoregressive LLMs from frontier labs by 10x in user throughputs.”
Reinforcement learning for dLLMs
Despite their advantages, dLLMs still lag behind autoregressive models in reasoning abilities. Reinforcement learning has become crucial for teaching LLMs complex reasoning skills. By training models based on reward signals (essentially rewarding them for correct reasoning steps or final answers) RL has pushed LLMs toward better instruction-following and reasoning.
Algorithms such as Proximal Policy Optimization (PPO) and the more recent Group Relative Policy Optimization (GRPO) have been central to applying RL effectively to autoregressive models. These methods typically rely on calculating the probability (or log probability) of the generated text sequence under the model’s current policy to guide the learning process.
This calculation is straightforward for autoregressive models due to their sequential, token-by-token generation. However, for dLLMs, with their iterative, non-sequential generation process, directly computing this sequence probability is difficult and computationally expensive. This has been a major roadblock to applying established RL techniques to improve dLLM reasoning.
The d1 framework tackles this challenge with a two-stage post-training process designed specifically for masked dLLMs:
- Supervised fine-tuning (SFT): First, the pre-trained dLLM is fine-tuned on a dataset of high-quality reasoning examples. The paper uses the “s1k” dataset, which contains detailed step-by-step solutions to problems, including examples of self-correction and backtracking when errors occur. This stage aims to instill foundational reasoning patterns and behaviors into the model.
- Reinforcement learning with diffu-GRPO: After SFT, the model undergoes RL training using a novel algorithm called diffu-GRPO. This algorithm adapts the principles of GRPO to dLLMs. It introduces an efficient method for estimating log probabilities while avoiding the costly computations previously required. It also incorporates a clever technique called “random prompt masking.”
During RL training, parts of the input prompt are randomly masked in each update step. This acts as a form of regularization and data augmentation, allowing the model to learn more effectively from each batch of data.
d1 in real-world applications
The researchers applied the d1 framework to LLaDA-8B-Instruct, an open-source dLLM. They fine-tuned it using the s1k reasoning dataset for the SFT stage. They then compared several versions: the base LLaDA model, LLaDA with only SFT, LLaDA with only diffu-GRPO and the full d1-LLaDA (SFT followed by diffu-GRPO).
These models were tested on mathematical reasoning benchmarks (GSM8K, MATH500) and logical reasoning tasks (4×4 Sudoku, Countdown number game).
The results showed that the full d1-LLaDA consistently achieved the best performance across all tasks. Impressively, diffu-GRPO applied alone also significantly outperformed SFT alone and the base model.
“Reasoning-enhanced dLLMs like d1 can fuel many different kinds of agents for enterprise workloads,” Grover said. “These include coding agents for instantaneous software engineering, as well as ultra-fast deep research for real-time strategy and consulting… With d1 agents, everyday digital workflows can become automated and accelerated at the same time.”
Interestingly, the researchers observed qualitative improvements, especially when generating longer responses. The models began to exhibit “aha moments,” demonstrating self-correction and backtracking behaviors learned from the examples in the s1k dataset. This suggests the model isn’t just memorizing answers but learning more robust problem-solving strategies.
Autoregressive models have a first-mover advantage in terms of adoption. However, Grover believes that advances in dLLMs can change the dynamics of the playing field. For an enterprise, one way to decide between the two is if their application is currently bottlenecked by latency or cost constraints.
According to Grover, reasoning-enhanced diffusion dLLMs such as d1 can help in one of two complementary ways:
- If an enterprise is currently unable to migrate to a reasoning model based on an autoregressive LLM, reasoning-enhanced dLLMs offer a plug-and-play alternative that allows enterprises to experience the superior quality of reasoning models at the same speed as non-reasoning, autoregressive dLLM.
- If the enterprise application allows for a larger latency and cost budget, d1 can generate longer reasoning traces using the same budget and further improve quality.
“In other words, d1-style dLLMs can Pareto-dominate autoregressive LLMs on the axis of quality, speed, and cost,” Grover said.
The post 30 seconds vs. 3: The d1 reasoning framework that’s slashing AI response times appeared first on Venture Beat.