Amazon Announces Inference Chips Deal With Cerebras
TL;DR
Amazon Web Services announced a landmark partnership with Cerebras Systems on March 13, 2026, combining AWS's Trainium chips with Cerebras's wafer-scale processors to create a "disaggregated inference" architecture that claims speeds 21 times faster than Nvidia's Blackwell systems. The deal — coming just two months after OpenAI signed a $10 billion agreement with Cerebras — signals a tectonic shift in the AI chip market as hyperscalers move to break Nvidia's dominance in the rapidly growing inference segment.
On March 13, 2026, Amazon Web Services quietly dropped a bombshell on the semiconductor industry. The world's largest cloud provider announced it would integrate Cerebras Systems' wafer-scale chips into its data centers — making AWS the first hyperscaler to commit to the unconventional chipmaker's technology . The deal, built around a novel "disaggregated inference" architecture, represents Amazon's most aggressive move yet to loosen Nvidia's stranglehold on the AI chip market.
But the partnership is more than a procurement announcement. It is a strategic declaration that the era of the monolithic GPU — where a single chip handles every phase of AI computation — may be ending. And the stakes could hardly be higher: the AI inference market is projected to account for two-thirds of all AI compute by 2026, up from one-third in 2023 .
The Architecture: Splitting the Atom of AI Inference
At the heart of the Amazon-Cerebras partnership lies a deceptively simple idea: what if the two stages of running a large language model were handled by entirely different chips, each optimized for its specific task?
When a user sends a prompt to an AI model, two things happen. First, the system processes the entire prompt in a phase called "prefill" — a highly parallel, compute-intensive operation. Then comes "decode," where the model generates its response one token at a time — a sequential, memory-bandwidth-hungry process .
Traditional GPU-based systems use the same hardware for both phases, which means neither is fully optimized. Amazon and Cerebras are betting that splitting these stages across purpose-built silicon will unlock dramatic performance gains.
Under the new architecture, Amazon's Trainium chips handle the prefill stage, processing user prompts and computing the attention mechanism's key-value cache. That cache is then transmitted via Amazon's Elastic Fabric Adapter (EFA) interconnect to Cerebras's CS-3 system, whose Wafer-Scale Engine takes over the decode phase .
The performance claims are striking. AWS says the disaggregated system delivers inference speeds "more than 20 times faster than the competition" — a pointed reference to Nvidia's Blackwell platform — and provides five times more high-speed token capacity in the same hardware footprint .
Inside the Wafer-Scale Engine
Cerebras's core technology is as radical as its name suggests. While conventional chips are cut from silicon wafers into small individual processors, Cerebras uses nearly an entire 300-millimeter wafer as a single chip. The resulting WSE-3 measures 46,255 square millimeters — roughly the size of a dinner plate — and packs 4 trillion transistors across 900,000 AI-optimized cores .
The design philosophy prioritizes memory bandwidth above all else. The CS-3 system provides 44 gigabytes of on-chip SRAM with 27 petabytes per second of internal memory bandwidth — more than 200 times the bandwidth available through Nvidia's NVLink interconnect . For the decode phase of inference, where the bottleneck is memory access rather than raw compute, this architecture is purpose-built to dominate.
"Speed remains a critical bottleneck for demanding workloads like real-time coding assistance and interactive applications," said David Brown, AWS Vice President of Compute and ML Services, in the announcement . The subtext is clear: as AI moves from experimental to production workloads, inference speed is becoming the competitive differentiator.
A Company Transformed
For Cerebras, the AWS deal is the latest — and perhaps most consequential — in a series of partnerships that have transformed the company from an exotic chipmaker with a customer concentration problem into a serious contender for the AI infrastructure market.
The company's trajectory has been dramatic. Just 18 months ago, Cerebras was preparing for an IPO that was ultimately withdrawn in October 2025 after CEO Andrew Feldman acknowledged the company's financial filings had become "stale" . At the time, regulators were scrutinizing the company's heavy dependence on a single customer: G42, the Abu Dhabi-based AI firm that accounted for 87% of Cerebras's revenue in the first half of 2024 .
Since then, Cerebras has executed a remarkable pivot. In January 2026, OpenAI signed a deal worth over $10 billion for up to 750 megawatts of Cerebras-powered compute capacity, with delivery stretching through 2028 . Days later, Oracle name-dropped Cerebras alongside Nvidia and AMD during its earnings call . And in February 2026, Cerebras closed a $1 billion Series H round led by Tiger Global at a $23 billion valuation — nearly tripling its worth in just four months .
The AWS partnership further diversifies Cerebras's customer base at a critical moment. G42 is no longer listed as an investor in the company as of December 2025, though it remains a customer through the Condor Galaxy supercomputer network . For a company targeting a second-quarter 2026 IPO, having Amazon, OpenAI, and Oracle as customers is a fundamentally different pitch to public market investors than relying on a single Middle Eastern tech conglomerate.
The Nvidia Question
The Amazon-Cerebras announcement lands just days before Nvidia's GTC 2026 keynote on March 16, where CEO Jensen Huang is expected to unveil the "Vera Rubin" platform and potentially Nvidia's own response to the disaggregated inference trend .
Nvidia's position remains formidable. The company commands an estimated 80-90% of the high-end AI chip market, its data center revenue hit $30.8 billion in a single quarter in 2025, and its CUDA software ecosystem creates a switching cost that few competitors have overcome . But the structural threat is real: the inference market is growing faster than training, and custom ASIC shipments from cloud providers are projected to grow 44.6% in 2026, compared to 16.1% growth for GPU shipments .
The disaggregated approach poses a particular challenge for Nvidia's business model. Nvidia's GPUs are designed as general-purpose accelerators — powerful enough to handle any phase of AI computation, but not specifically optimized for any single one. If the industry moves toward specialized hardware combinations for inference, Nvidia's "one chip to rule them all" approach could become a liability rather than an asset.
Nvidia is not standing still. In late December 2025, the company spent $17 billion to acquire Groq, an inference chip startup, signaling its own interest in specialized decode hardware . Analysts expect Huang to detail how Nvidia plans to combine its GPUs with Groq's technology at GTC — but Amazon has been quick to note that the Nvidia-Groq timeline remains unclear, while its own Trainium-Cerebras system is just months from production .
Amazon's Broader AI Chip Strategy
The Cerebras deal fits into Amazon's increasingly sophisticated approach to AI silicon. Through its subsidiary Annapurna Labs — acquired in 2015 for $350 million — AWS has built a custom chip portfolio that includes Inferentia for inference workloads and Trainium for training .
The latest generation, Trainium2, delivers competitive performance at roughly half the cost of comparable Nvidia H100 instances . AWS is currently deploying a massive 400,000-chip Trainium2 cluster for Anthropic under "Project Rainier," proving that its custom silicon can handle frontier AI workloads .
With the Cerebras partnership, Amazon is now pursuing a hybrid strategy: its own Trainium chips for the compute-heavy prefill phase, paired with Cerebras's specialized hardware for memory-bandwidth-intensive decode. The approach lets AWS offer differentiated performance without having to develop wafer-scale technology in-house — while also giving Cerebras access to the world's largest cloud platform.
AWS customers can expect a phased rollout of the Trainium-Cerebras clusters beginning in the third quarter of 2026, starting in the US-East (N. Virginia) and US-West (Oregon) regions. The service will be available through Amazon Bedrock, offering open-source LLMs and Amazon's own Nova models running on the disaggregated infrastructure .
The Inference Economy
The timing of this deal reflects a broader industry reckoning. The AI sector is shifting from a training-dominated era — where companies competed to build the largest models — to an inference-dominated one, where the challenge is running those models at scale, at speed, and at reasonable cost.
Deloitte projects that inference will account for two-thirds of all AI compute spending by 2026, up from approximately one-third in 2023 . This shift has profound implications for chip architecture. Training workloads reward raw compute power and can tolerate higher latency; inference workloads demand low latency, high throughput, and — critically — cost efficiency at enormous scale.
"Every enterprise around the world will be able to benefit from blisteringly fast inference within their existing AWS environment," said Cerebras CEO Andrew Feldman . The phrase "existing AWS environment" is key: by integrating with Bedrock, Cerebras is not asking customers to adopt an entirely new platform, but rather offering a speed upgrade within an ecosystem they already use.
What This Means for the Market
The Amazon-Cerebras deal represents a significant development in the AI chip market, but its ultimate impact will depend on execution. Several questions remain unanswered.
First, financial terms have not been disclosed, making it difficult to assess the deal's scale relative to Cerebras's $10 billion OpenAI agreement or its overall revenue trajectory — which stood at $136 million for just the first half of 2024 .
Second, Cerebras's wafer-scale technology has always faced manufacturing questions. Each WSE-3 chip requires nearly an entire silicon wafer, and yield rates for such large chips are inherently challenging. TSMC fabricates the chips, but scaling production to meet demand from OpenAI, AWS, and other customers simultaneously will test those manufacturing capabilities.
Third, the performance claims — particularly the "21 times faster than Blackwell" benchmark — have not been independently verified. Real-world performance across diverse model architectures and workload patterns may differ from optimized demonstrations.
What is clear is that the AI chip market is entering a new phase. The monolithic GPU approach that carried the industry through the training era is being supplemented — and in some cases challenged — by specialized, disaggregated architectures designed for the inference economy. Amazon's decision to bet on Cerebras, the first such commitment by any hyperscaler, suggests the company sees this not as an experiment but as the future of AI infrastructure.
For Nvidia, the message is unmistakable: the inference market will not be won with the same playbook that dominated training. For Cerebras, the pressure is equally intense — the company must now deliver on its promises at hyperscale, with the world's most demanding cloud customers watching. And for the broader AI industry, the deal underscores a fundamental truth: in a world where inference is the product, the chips that run it will determine who wins.
Related Stories
Nvidia Prepares AI Inference Chip Launch to Counter Rising Competitors
Microsoft Considers Legal Action Over OpenAI's $50B Amazon Cloud Deal
Meta Eyes 20% Workforce Cut as AI Infrastructure Costs Soar
Trump Appoints Zuckerberg, Ellison, and Huang to Tech Advisory Panel
Investigation Reveals How Jeff Bezos Transformed and Disrupted The Washington Post
Sources (14)
- [1]AWS and Cerebras Collaboration Aims to Set a New Standard for AI Inference Speed and Performance in the Cloudpress.aboutamazon.com
Official AWS press release announcing the Cerebras partnership to deliver the fastest AI inference solutions available for generative AI applications and LLM workloads.
- [2]Prediction: The AI 'Inference Era' Will Crown a New Winner by the End of 2026fool.com
Analysis of the shifting AI chip market, projecting inference will account for two-thirds of all AI compute by 2026 and custom ASIC shipments growing 44.6% vs 16.1% for GPUs.
- [3]AWS partners with big chip co. Cerebras for AI 'inference disaggregation'datacenterdynamics.com
Detailed reporting on the disaggregated inference architecture, where Trainium handles prefill and Cerebras CS-3 handles decode via EFA interconnect.
- [4]AWS will bring Cerebras' wafer-size WSE-3 chip to its cloud platformsiliconangle.com
Coverage of AWS becoming the first hyperscaler to integrate Cerebras's wafer-scale technology, with phased rollout planned for Q3 2026.
- [5]AWS and Cerebras collaboration aims to set a new standard for AI inference speed and performance in the cloudaboutamazon.com
Amazon's detailed explanation of the disaggregated architecture, noting CS-3 provides 27 petabytes per second of memory bandwidth — over 200x Nvidia NVLink.
- [6]Cerebras Systems Unveils World's Fastest AI Chip with 4 Trillion Transistorscerebras.ai
Press release on the WSE-3, measuring 46,255 mm² with 4 trillion transistors and 900,000 AI-optimized cores delivering 125 petaflops of AI compute.
- [7]Cerebras IPO: everything you need to knowcapital.com
Overview of Cerebras's IPO journey, including the October 2025 withdrawal and plans to refile for a Q2 2026 listing.
- [8]Is it bad to rely on one customer for 87% of your revenue?sherwood.news
Investigation into Cerebras's customer concentration risk, with G42 accounting for 83% of revenue in 2023 and 87% in the first half of 2024.
- [9]Cerebras scores OpenAI deal worth over $10 billion ahead of AI chipmaker's IPOcnbc.com
CNBC reporting on OpenAI's $10 billion deal with Cerebras for up to 750 megawatts of compute capacity through 2028.
- [10]AI chipmaker Cerebras namedropped by Oracle, alongside Nvidia and AMDcnbc.com
Oracle references Cerebras alongside Nvidia and AMD during its March 2026 earnings call, signaling the chipmaker's growing industry recognition.
- [11]Nvidia Rival Cerebras Raises $1 Billion in Funding at $23 Billion Valuationbloomberg.com
Bloomberg reporting on Cerebras's Series H round led by Tiger Global, nearly tripling the company's valuation in four months.
- [12]Abu Dhabi-based G42 taps out of Cerebrassemafor.com
Semafor reporting that G42 is no longer listed as an investor in Cerebras Systems as of December 2025, despite being the company's biggest customer.
- [13]The End of the GPU Monolith: Amazon and Cerebras Partner to Dethrone Nvidia with Disaggregated Inferencefinancialcontent.com
Analysis of Nvidia's expected response at GTC 2026, including the Vera Rubin platform and the $17 billion Groq acquisition's role in countering disaggregated inference.
- [14]Amazon Trainium and Inferentia Silicon Ecosystem Guideintrol.com
Guide to AWS's custom AI chip ecosystem including Trainium2 performance specs, cost comparisons to Nvidia, and Project Rainier for Anthropic.
Sign in to dig deeper into this story
Sign In