How Cerebras is breaking the GPU bottleneck on AI inference

Nvidia has long dominated the market in compute hardware for AI with its graphics processing units (GPUs). However, the Spring 2024 launch of Cerebras Systems’ mature third-generation chip, based on their flagship wafer-scale engine technology, is shaking up the landscape by offering enterprises an innovative and competitive alternative.

This article explores why Cerebras’ new product matters, how it stacks up against both Nvidia’s offerings and those of Groq, another new startup providing advanced AI-specialized compute hardware and highlights what enterprise decision-makers should consider when navigating this evolving landscape.

First, a note on why the timing of Cerebras’ and Groq’s challenge is so important. Until now, most of the processing for AI has been in the training of large language models (LLMs), not in actually applying those models for real purposes. Nvidia’s GPUs have been extremely dominant during that period. But in the next 18 months, industry experts expect the market to reach an inflection point as the AI projects that many companies have been training and developing will finally be deployed. At that point, AI workloads shift from training to what the industry calls inference, where speed and efficiency become much more important. Will Nvidia’s line of GPUs be able to maintain top position?

Let’s take a deeper look. Inference is the process by which a trained AI model evaluates new data and produces results– for example, during a chat with an LLM, or as a self-driving car maneuvers through traffic–as opposed to training, when the model is being shaped behind the scenes before being released. Inference is critical to all AI applications, from split-second real-time interactions to the data analytics that drive long-term decision-making. The AI inference market is on the cusp of explosive growth, with estimates predicting it will reach $90.6 billion by 2030.

Historically, AI inference has been performed on GPU chips. This was due to GPUs general superiority over CPU at the parallel computing needed for efficient training over massive datasets. However, as demand for heavy inference workloads increases, GPUs consume significant power, generate high levels of heat and are expensive to maintain.

Cerebras, founded in 2016 by a team of AI and chip design experts, is a pioneer in the field of AI inference hardware. The company’s flagship product, the Wafer-Scale Engine (WSE), is a revolutionary AI processor that sets a new bar for inference performance and efficiency. The recently launched third generation CS-3 chip boasts 4 trillion transistors, making it the physically largest neural network chip ever produced–at 56x larger than the biggest GPUs it is closer in size to a dinner plate than a postage stamp. It contains 3000x more on-chip memory. This means that individual chips can handle huge workloads without having to network, an architectural innovation that enables faster processing speeds, greater scalability, and reduced power consumption.

The CS-3 excels with LLMs; reports indicate that Cerebras’ chip can process an eye-watering 1,800 tokens per second for the Llama 3.1 8B model, far outpacing current GPU-based solutions. Moreover, with pricing starting at just 10 cents per million tokens, Cerebras is positioning itself as a competitive solution.

The need for speed

Given the demand for AI inference, it is no surprise that Cerebras’ impressive stats are drawing industry attention. Indeed, the company has had enough early traction that its press kit cites several industry leaders lauding its technology.

“Speed and scale change everything,” according to Kim Branson, SVP of AI/ML at GlaxoSmithKline, where the boost provided by Cerebras’ CS-3 has reportedly improved the company’s ability to handle massive datasets for drug discovery and analysis.

Denis Yarats, CTO of Perplexity, sees ultra-fast inference as the key to reshaping search engines and user experiences. “Lower latencies drive higher user engagement,” said Yarats. “With Cerebras’ 20x speed advantage over traditional GPUs, we believe user interaction with search and intelligent answer engines will be fundamentally transformed.”

Russell d’Sa, CEO of LiveKit, highlighted how Cerebras’ ultra-fast inference has enabled his company to develop next-gen multimodal AI applications with voice and video-based interactions. “Combining Cerebras’ best-in-class compute with LiveKit’s global edge network has allowed us to create AI experiences that feel more human, thanks to the system’s ultra-low latency.”

The competitive landscape: Nvidia vs. Groq vs. Cerebras

Despite the power of its technology, Cerebras faces a competitive market. Nvidia’s dominance in the AI hardware market is well established, with its Hopper GPUs being a staple in training and running AI models. Compute on Nvidia’s GPUs is available through cloud providers such as Amazon Web Services, Google Cloud Platform, or Microsoft Azure and Nvidia’s established market presence gives it a significant edge in terms of ecosystem support and customer trust.

However, the AI hardware market is evolving, and competition is intensifying. Groq, another AI chip startup, has also been making waves with its own inference-focused language processing unit (LPU). Based on proprietary Tensor Streaming Processor (TSP) technology, Groq also boasts impressive performance benchmarks, energy efficiency and competitive pricing.

Despite the impressive performance of Cerebras and Groq, many enterprise decision-makers may not have heard much about them yet, primarily because they are new entrants to the field and are still expanding their distribution channels, whereas Nvidia GPUs are available from all major cloud providers. However, both Cerebras and Groq now offer robust cloud computing solutions and sell their hardware. Cerebras Cloud provides flexible pricing models, including per-model and per-token options, allowing users to scale their workloads without heavy upfront investments. Similarly, Groq Cloud offers users access to its cutting-edge inference hardware via the cloud, boasting that users can “switch from other providers like OpenAI by switching three lines of code”. Both companies’ cloud offerings allow decision-makers to experiment with advanced AI inference technologies at a lower cost and with greater flexibility, making it relatively easy to get started despite their smaller market presence compared to Nvidia.

How do the options stack up?

Nvidia

Performance: GPUs like the H100 excel in parallel processing tasks, but cannot match the speed of the specialized CS-3 and LPU for AI inference.
Energy Efficiency: While Nvidia has made strides in improving the energy efficiency of its GPUs, they remain power-hungry compared to Cerebras and Groq’s offerings.
Scalability: GPUs are highly scalable, with well-established methods for connecting multiple GPUs to work on large AI models.
Flexibility: Nvidia offers extensive customization through its CUDA programming model and broad software ecosystem. This flexibility allows developers to tailor their GPU setups to a wide range of computational tasks beyond AI inference and training.Cloud Compute Access: Nvidia GPU compute as a service is available at hyperscale through many cloud providers, such as GCP, AWS and Azure.

Cerebras

Power: CS-3 is a record-breaking powerhouse with 900,000 AI-optimized cores and 4 trillion transistors, capable of handling AI models with up to 24 trillion parameters. It offers peak AI performance of 125 petaflops, making it exceptionally efficient for large-scale AI models.
Energy Efficiency: The CS-3’s massive single-chip design reduces the need for traffic between components, which significantly lowers energy usage compared to massively networked GPU alternatives.
Scalability: Cerebras’ WSE-3 is highly scalable, capable of supporting clusters of up to 2048 systems, which deliver up to 256 exaflops of AI compute.
Strategic Partnerships: Cerebras is integrating with major AI tools like LangChain, Docker and Weights and Biases, providing a robust ecosystem that supports rapid AI application development.
Cloud Compute Access: Currently only available through Cerebras Cloud, which offers flexible per-model or by per-token pricing.

Groq

What Enterprise Decision-Makers Should Do Next

Given the rapidly evolving landscape of AI hardware, enterprise decision-makers should take a proactive approach to evaluating their options. While Nvidia remains the market leader, the emergence of Cerebras and Groq offers compelling alternatives to watch. Long the gold standard of AI compute, Nvidia GPU now appears as a general tool made to do a job, rather than a specialized tool optimized for its purpose. Purpose-designed AI chips such as the Cerebras CS-3 and Groq LPU may represent the future.

Here are some steps that business leaders can take to navigate this changing landscape:

Assess Your AI Workloads: Determine whether your current and planned AI workloads could benefit from the performance advantages offered by Cerebras or Groq. If your organization relies heavily on LLMs or real-time AI inference, these new technologies could provide significant benefits.
Assess Cloud and Hardware Offerings: Once your workloads are clearly defined, evaluate the cloud and hardware solutions provided by each vendor. Consider whether using cloud-based compute services, investing in on-premises hardware, or taking a hybrid approach will most suit your needs.
Evaluate Vendor Ecosystems: Nvidia GPU compute is widely available from cloud providers, and its hardware and software developer ecosystems are robust, whereas Cerebras and Groq are new players in the field.
Stay Agile and Informed: Maintain agility in your decision-making process, and ensure your team stays informed about the latest advancements in AI hardware and cloud services.

The entry of startup chip-makers Cerebras and Groq into the field of AI inference changes the game significantly. Their specialized chips like the CS-3 and LPU outperform the Nvidia GPU processors that have been the industry standard. As the AI inference technology market continues to evolve, enterprise decision-makers should continually evaluate their needs and strategies.

The post How Cerebras is breaking the GPU bottleneck on AI inference appeared first on Venture Beat.