Apple researchers have developed a new artificial intelligence system that can understand ambiguous references to on-screen entities as well as conversational and background context, enabling more natural interactions with voice assistants, according to a paper published on Friday.
The system, called ReALM (Reference Resolution As Language Modeling), leverages large language models to convert the complex task of reference resolution — including understanding references to visual elements on a screen — into a pure language modeling problem. This allows ReALM to achieve substantial performance gains compared to existing methods.
“Being able to understand context, including references, is essential for a conversational assistant,” wrote the team of Apple researchers. “Enabling the user to issue queries about what they see on their screen is a crucial step in ensuring a true hands-free experience in voice assistants.”
Enhancing conversational assistants
To tackle screen-based references, a key innovation of ReALM is reconstructing the screen using parsed on-screen entities and their locations to generate a textual representation that captures the visual layout. The researchers demonstrated that this approach, combined with fine-tuning language models specifically for reference resolution, could outperform GPT-4 on the task.
“We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references,” the researchers wrote. “Our larger models substantially outperform GPT-4.”
Practical applications and limitations
The work highlights the potential for focused language models to handle tasks like reference resolution in production systems where using massive end-to-end models is infeasible due to latency or compute constraints. By publishing the research, Apple is signaling its continuing investments in making Siri and other products more conversant and context-aware.
Still, the researchers caution that relying on automated parsing of screens has limitations. Handling more complex visual references, like distinguishing between multiple images, would likely require incorporating computer vision and multi-modal techniques.
Apple races to close AI gap as rivals soar
Apple is quietly making significant strides in artificial intelligence research, even as it trails tech rivals in the race to dominate the fast-moving AI landscape.
From multimodal models that blend vision and language, to AI-powered animation tools, to techniques for building high-performing specialized AI on a budget, a steady drumbeat of breakthroughs from the company’s research labs suggest its AI ambitions are rapidly escalating.
But the famously secretive tech giant faces stiff competition from the likes of Google, Microsoft, Amazon and OpenAI, who have aggressively productized generative AI in search, office software, cloud services and more.
Apple, long a fast follower rather than a first mover, now confronts a market being transformed at breakneck speed by artificial intelligence. At its closely watched Worldwide Developers Conference in June, the company is expected to unveil a new large language model framework, an “Apple GPT” chatbot, and other AI-powered features across its ecosystem.
“We’re excited to share details of our ongoing work in AI later this year,” CEO Tim Cook recently hinted on an earnings call. Despite its characteristic opacity, it’s clear Apple’s AI efforts are sweeping in scope.
Yet as the battle for AI supremacy heats up, the iPhone maker’s lateness to the party has put it in an uncharacteristic position of weakness. Deep coffers, brand loyalty, elite engineering and a tightly integrated product portfolio give it a puncher’s chance — but there are no guarantees in this high stakes contest.
A new age of ubiquitous, truly intelligent computing is on the horizon. Come June, we’ll see if Apple has done enough to ensure it has a hand in shaping it.
The post Apple researchers develop AI that can ‘see’ and understand screen context appeared first on Venture Beat.