Researchers at the Tokyo-based startup Sakana AI have developed a new technique that enables language models to use memory more efficiently, helping enterprises cut the costs of building applications on top of large language models (LLMs) and other Transformer-based models.
The technique, named “Universal Transformer Memory,” uses special neural networks to optimize LLMs to keep bits of information that matter and discard redundant details from their context.
Optimizing Transformer memory
The responses of Transformer models, the backbone of LLMs, depend on the content of their “context window,” — that is, what they receive as input from users.
The context window can be considered as the model’s working memory. Tweaking the content of the context window can have a tremendous impact on the model’s performance, which has given rise to an entire field of “prompt engineering.”
Current models support very long context windows with hundreds of thousands, or even millions of tokens (an LLM’s numerical representations of the words, word parts, phrases, concepts and numbers inputted by users in their prompts).
This enables users to cram more information in their prompts. However, longer prompts can result in higher compute costs and slower performance. Optimizing prompts to remove unnecessary tokens and keeping important information can reduce costs and increase speed.
Current prompt optimization techniques are resource-intensive or require users to manually test different configurations to reduce the size of their prompts.
Neural Attention Memory Modules
Universal Transformer Memory optimizes prompts using Neural Attention Memory Models (NAMMs), simple neural networks that decide whether to “remember” or “forget” each given token stored in the LLM’s memory.
“This new capability allows transformers to discard unhelpful or redundant details, and focus on the most critical information, something we find to be crucial for tasks requiring long-context reasoning,” the researchers write.
NAMMs are trained separately from the LLM and are combined with the pre-trained model at inference time, which makes them flexible and easy to deploy. However, they need access to the inner-activations of the model, which means they can only be applied to open source models.
Like other techniques developed by Sakana AI, NAMMs are trained through evolutionary algorithms instead of gradient-based optimization methods. By iteratively mutating and selecting the best-performing models through trial and error, evolution algorithms optimize NAMMs for efficiency and performance. This is especially important since NAMMs are trying to learn a non-differentiable goal: keeping or discarding tokens.
NAMMs operate on the attention layers of LLMs, one of the key components of the Transformer architecture that determines the relations and importance of each token in the model’s context window. Based on attention values, NAMMs determine which tokens should be preserved and which can be discarded from the LLM’s context window. This attention-based mechanism makes it possible to use a trained NAMM on various models without further modification. For example, a NAMM trained on text-only data can be applied to vision or multi-modal models without additional training.
Universal memory in action
To test the Universal Transformer Memory concept in action, the researchers trained a NAMM on top of an open source Meta Llama 3-8B model. Their experiments show that with NAMMs, Transformer-based models perform better on natural language and coding problems on very long sequences. Meanwhile, by discarding unnecessary tokens, NAMM enabled the LLM model to save up to 75% of its cache memory while performing the tasks.
“Across our benchmarks, NAMMs provide clear performance improvements to the Llama 3 8b transformer,” the researchers write. “Furthermore, our memory systems yield notable side benefits, reducing the context size of each layer, while never being explicitly optimized for memory efficiency.”
They also tested the model on the 70B version of Llama as well as Transformer models designed for other modalities and tasks, such as Llava (computer vision) and Decision Transformer (reinforcement learning).
“Even in these out-of-distribution settings, NAMMs retain their benefits by discarding tokens such as redundant video frames and suboptimal actions, allowing their new base models to focus on the most relevant information to improve performance,” the researchers write.
Task-dependent behavior
Another interesting finding is that NAMMs automatically adjust their behavior based on the task.
For example, for coding tasks, the model discards contiguous chunks of tokens that correspond to comments and whitespaces that don’t affect the code’s execution.
On the other hand, in natural language tasks, the model discards tokens that represent grammatical redundancies and don’t affect the meaning of the sequence.
The researchers released the code for creating your own NAMMs.Techniques such as Universal Transformer Memory can be very useful for enterprise applications that process millions of tokens and can benefit from speed boosts and cost reduction. The reusability of a trained NAMM also makes it a versatile tool to use across different applications in an enterprise.
For the future, the researchers suggest more advanced techniques, such as using NAMMs during the training of LLMs to further extend their memory capabilities.
“This work has only begun to tap into the potential of our new class of memory models, which we anticipate might offer many new opportunities to advance future generations of transformers,” the researchers write.
The post New LLM optimization technique slashes memory costs up to 75% appeared first on Venture Beat.