Non-GPU AI Accelerators: The Post-NVIDIA Landscape

TL;DR: NVIDIA’s CUDA moat is real, but the AI chip market is diversifying fast. AWS Trainium3 (TSMC 3nm) and Trainium4 are now GA or announced. Google TPU v6 Trillium is shipping. Cerebras delivers 2,500 tok/s on Llama 4 Maverick. SambaNova just unveiled the SN50 claiming 5x over Blackwell. Intel cancelled Falcon Shores and pivoted to Jaguar Shores. Korea’s FuriosaAI and Rebellions are in mass production. TSMC fabs nearly everything — Intel’s Gaudi included.

NVIDIA owns ~80% of the AI accelerator market, and CUDA’s 15-year ecosystem lock-in makes switching painful. But every major hyperscaler is designing custom silicon, and a new generation of startups is targeting inference-specific chips. This is a map of who’s shipping, who’s close, and who’s still in stealth — updated with live research as of May 2026.

Tier 1 — Shipping at Scale

AWS Trainium2, Trainium3, and Trainium4

Annapurna Labs (acquired by AWS) designs two chip families: Trainium for training, Inferentia for inference. Both run on the Neuron SDK with PyTorch, TensorFlow, and JAX support.

Three generations are now in play:

Trainium2 (TSMC 5nm) — the workhorse. The UltraServer packs 64 Trn2 chips sharing 83.2 TB of memory with 178 TFLOPS FP16 aggregate. Anthropic trains Claude on these.
Trainium3 (TSMC 3nm) — GA December 2025 at re:Invent. Delivers 2.52 PFLOPS FP8 per chip, 144GB HBM3e, 1.5x memory capacity and 1.7x bandwidth over Trn2. Uses NeuronLink-v4 interconnect with the 160-lane Scorpio-X switch. Anthropic is already building multi-gigawatt clusters on Trn3.
Trainium4 — announced at re:Invent 2025 alongside Trn3 GA. Details are still emerging but it’s positioned as the next leap beyond Trn3.

Inferentia2 (TSMC 7nm) handles inference at $0.76/hr for the smallest instance up to $12.98/hr for the 48-chip variant.

Google TPU v5p and v6 Trillium

Google’s Tensor Processing Units are matrix-multiply ASICs built on TSMC 4nm. Two generations are actively deployed:

TPU v5p — flagship for large-scale training, 459 TFLOPS BF16, 95GB HBM3, scales to 4,608 chips per pod at ~$4.20/hr per chip
TPU v6 “Trillium” — GA 2025. Ships in two variants: v6p (high-performance) and v6e (cost-efficient). Each chip has 32GB HBM3e at ~1,600 GB/s bandwidth, with 4.7x compute performance over v5e. Liquid-cooled pods scale to 256 chips delivering 230 PFLOPS FP8 aggregate with 8TB total HBM at 60kW.

Gemini was trained on TPUs. Google’s ICI (Inter-Chip Interconnect) links chips at 1.6 Tbps within a pod, avoiding the Ethernet bottleneck that plagues GPU clusters.

Cerebras WSE-3 and CS-3

Cerebras takes a radically different approach: the entire 300mm silicon wafer is one chip. The WSE-3 packs 4 trillion transistors, 900,000 AI cores, and 44GB of on-chip SRAM with 21 PB/s internal bandwidth. Peak performance: 125 PFLOPS FP16.

The architecture is compute-in-memory — there’s no von Neumann bottleneck because data doesn’t move between separate memory and compute units. As of December 2025, CS-3 systems are delivering Llama 4 Maverick at 2,500+ tokens/sec per user — over 2x faster than NVIDIA DGX B200 Blackwell. The Cerebras Inference API launched September 2024 with a free tier for experimentation.

Groq LPU

Groq’s Language Processing Unit is a deterministic inference ASIC built on TSMC 14nm. It runs Llama 70B at over 800 tokens/sec with guaranteed latency — the same response time every query, because there’s no batching overhead.

The trade-off: only 230MB of on-chip SRAM per chip with no external memory. Large models require sharding across multiple chips. It’s inference-only — no training capability. In September 2025, Groq raised $750M in Series E at a $6.9B valuation as inference demand surges.

SambaNova SN40L and SN50

The SN40L (TSMC 7nm) uses SambaNova’s Composable Replayable Dataflow Architecture (CRDA) with 128GB HBM3e at 3.2 TB/s bandwidth. It can run full 5TB-parameter models on a single 8-chip node. Handles both training and inference. Saudi Aramco is a major deployment partner.

SN50 — unveiled February 2026. Claims 5x more compute and 4x more network bandwidth than SN40L. Scales to 2,048 accelerators with multi-TB/s Ethernet interconnect. SambaNova claims it’s 5x faster and 3x cheaper than NVIDIA Blackwell. The company raised $350M+ in Series E (led by Vista Equity) and Intel reportedly explored acquisition.

Intel Gaudi 3 (and the Falcon Shores Cancellation)

Intel’s Habana Labs acquisition produces an accelerator on TSMC 5nm with 915 TFLOPS BF16 and 1,832 TFLOPS FP8. The key selling point: estimated $12-15K per card — roughly half the H100 price. Each chip has 24 x 100GbE RoCE links for interconnect, plus 8x PCIe Gen5.

However, Intel cut Gaudi 3’s 2025 shipment target by 30% (from 300-350K to 200-250K units). More significantly, Intel cancelled Falcon Shores in December 2025 — the chip that was supposed to succeed Gaudi. Instead, Intel pivoted to a “rack-scale solution” called Jaguar Shores, signaling that standalone accelerator cards aren’t enough to compete with NVIDIA’s GB200 NVL72 rack-level approach.

Available through AWS (DL1 instances for Gaudi 2), Intel Developer Cloud, and OEM partners (Supermicro, Dell).

AMD MI300X, MI325X, and MI350X

Technically GPUs (CDNA architecture), but the primary NVIDIA alternative. The MI300X (TSMC 5nm) has 192GB HBM3 at 5.3 TB/s and 1,308 TFLOPS BF16. Microsoft uses it for GPT-4 inference on Azure.

The MI325X upgrades to 288GB HBM3e and 2.1 PFLOPS FP8. MI350X (CDNA 4, TSMC 3nm) was detailed at Hot Chips 2025 and is expected H2 2025. The MI400 series is planned for 2026, with AMD claiming up to 10x more performance for frontier AI models. In October 2025, OpenAI and AMD announced a massive multi-year compute deal — a direct challenge to NVIDIA’s dominance.

Tier 2 — Shipping but Limited Availability

Microsoft Maia 200 and Meta MTIA v3

Both hyperscalers now have significant silicon generations in production:

Microsoft Maia 200 (TSMC 5nm) — inference accelerator natively integrated with Azure’s control plane for security, telemetry, diagnostics, and management at both chip and rack level. Officially announced January 2026. Not sold externally.

Meta MTIA — Meta revealed four new chips in 2025: MTIA 300, 400, 450, and 500. MTIA 300 is already in production for ranking and recommendations training. The lineup is optimized for Meta’s internal inference workloads (re-ranking, recommendations, ads). None sold externally.

Huawei Ascend 910C

Huawei’s domestic alternative to NVIDIA, fabricated on SMIC 7nm despite US sanctions. The Ascend 910C began mass production in Q1 2025. It’s positioned as a competitor to the NVIDIA H200 — its maturation allegedly spurred the US to reverse the H200 export ban to China. Used by Baidu, Tencent, and Alibaba. Not available outside China. Runs on Huawei’s MindSpore/CANN software stack.

Tenstorrent Black Hole

Jim Keller’s company builds RISC-V based AI accelerators. The Black Hole launched in 2025, featuring all-new RISC-V cores with 480 Tensix cores, 64 big RISC-V cores, 720MB SRAM, and 128GB GDDR6 per card. Developer products are available: the TT-QuietBox 2 liquid-cooled workstation starts at $9,999 with four Black Hole ASICs. Tenstorrent is targeting both training and inference with infinitely scalable RISC-V architecture.

d-Matrix Corsair

d-Matrix’s in-memory compute platform targets AI inference. Detailed at Hot Chips 2025, the Corsair uses Digital In-Memory Compute claiming 50 TOPS/W — 10x better energy efficiency for transformer inference. Shipping to early customers in 2025.

Korea’s Chip Ecosystem — FuriosaAI and Rebellions

South Korea is investing heavily in NPU development as a national strategy:

FuriosaAI Renegade (RNGD) — unveiled at Renegade 2026 Summit. TSMC 5nm, 40 billion transistors, 180W TDP. Mass production began January 2026. FuriosaAI claims ~40% reduction in data center TCO. Targets the inference-heavy workload shift — CEO Paik June-ho projects 70% of AI datacenter capacity will be inference by 2030.

Rebellions — Seoul-based NPU startup. November 2025: successful verification of its NPU. South Korea invested $166M in the company to support domestic AI infrastructure. Cloud partnership planned with Korean telecom KT.

EnCharge AI and Lightelligence

Two startups approaching the problem from completely different angles:

EnCharge AI — raised $100M Series B in 2025 for ultra-efficient AI accelerator chips targeting PCs at a fraction of GPU power and cost. Uses analog in-memory compute.

Lightelligence (MIT spin-off) — raised Series C in September 2025 for optical computing chips. Uses photonic (light-based) computation instead of electrons. Still early stage but conceptually could bypass the memory wall entirely.

Tier 3 — Announced, In Development, or Acquisition Targets

Etched Sohu

The boldest bet in the space: an ASIC hardwired specifically for transformer models on TSMC 4nm. Raised $500M with TSMC as a partner. Claims 500,000 tokens/sec on Llama 70B. However, as of March 2026, Sohu has not shipped to customers — no third-party benchmarks exist, and no inference provider has published production throughput data.

The risk is obvious — if transformer architecture is superseded by something else (state-space models, new attention variants), the chip becomes obsolete.

Celestial AI and Enfabrica

Celestial AI — optical interconnect technology (Photonic Fabric). Showcased at Hot Chips 2025. Marvell announced a $3.25B acquisition in 2025 to scale data center connectivity. This isn’t a compute chip — it’s the interconnect layer that could make all other chips more efficient.

Enfabrica — ACF-S “Millennium” chip launched in 2025. Software-defined RDMA networking for large-scale AI infrastructure. Like Celestial, it solves the networking bottleneck rather than the compute problem.

Graphcore (SoftBank)

Acquired by SoftBank in July 2024 for ~$600M (down from $2.8B valuation in 2020). The Bow IPU (TSMC 7nm, 900 TFLOPS FP16) is still available. In October 2025, SoftBank announced Graphcore will invest £1 billion ($1.3B) in India to build semiconductor infrastructure and create 500 jobs. SoftBank hasn’t announced next-gen IPU products — it may become internal-only infrastructure.

Foundry Reality

Foundry	Products
TSMC	AWS Trainium3 (3nm), Trainium2 (5nm), Google TPU v6 (4nm), Intel Gaudi 3 (5nm), AMD MI300X/350X (5nm/3nm), Cerebras WSE-3, Groq LPU (14nm), SambaNova SN40L (7nm), Tenstorrent Black Hole, d-Matrix Corsair, FuriosaAI Renegade (5nm), Etched Sohu (4nm), Microsoft Maia (5nm), Meta MTIA
Samsung	IBM AIU (5nm), early Rebellions chips
SMIC	Huawei Ascend 910C (7nm, sanctions-limited)

Intel uses TSMC for Gaudi, not Intel Foundry. Intel Foundry has zero products in this entire list.

Training vs Inference Split

Chips that can train LLMs: AWS Trn2/Trn3, Google TPU v5p/v6, Intel Gaudi 3, AMD MI300X/325X/350X, Cerebras CS-3, SambaNova SN40L, Huawei Ascend 910C, Tenstorrent Black Hole.

Inference only: AWS Inf2, Groq LPU, d-Matrix Corsair, FuriosaAI Renegade, Rebellions NPU, Etched Sohu, Meta MTIA, Microsoft Maia 200.

The split matters because training requires massive interconnect bandwidth and model parallelism, while inference cares more about latency, memory capacity per chip, and energy efficiency. The market is increasingly bifurcating — and inference volume is where the money is flowing.

Key Trends

3nm is the new frontier. AWS Trainium3 and AMD MI350X both moved to TSMC 3nm in 2025. The node advantage matters for both compute density and power efficiency.
Inference is eating training’s lunch. FuriosaAI projects 70% of datacenter AI capacity will be inference by 2030. Every new startup (d-Matrix, Etched, FuriosaAI, Rebellions) is inference-first or inference-only.
Hyperscalers are going vertical. Microsoft Maia 200 and Meta’s four MTIA variants show cloud providers building silicon optimized for their specific workloads. These will never be sold externally but reduce NVIDIA dependency.
Interconnect is the new battlefield. Celestial AI’s $3.25B acquisition by Marvell, Enfabrica’s ACF-S launch, and Intel’s pivot from Falcon Shores to Jaguar Shores all signal that the rack-scale interconnect problem matters as much as raw compute.
Korea and China are building parallel ecosystems. South Korea’s $166M investment in Rebellions and FuriosaAI’s mass production show a coordinated national strategy. China’s Huawei Ascend 910C reaching production parity with H200 shows sanctions haven’t stopped domestic chip development.
The startup funding wave is massive. 75 AI chip startups raised $3B in 2025. Groq ($750M), Etched ($500M), SambaNova ($350M), EnCharge ($100M). Whether this produces real products or a bubble remains to be seen.

References

AWS Trainium3 Deep Dive — SemiAnalysis (Dec 2025) — https://newsletter.semianalysis.com/p/aws-trainium3-deep-dive-a-potential
Amazon’s AI Resurgence: Multi-Gigawatt Trainium Expansion — SemiAnalysis (Sep 2025) — https://newsletter.semianalysis.com/p/amazons-ai-resurgence-aws-anthropics-multi-gigawatt-trainium-expansion
AWS Trainium Official Page — AWS (2025) — https://aws.amazon.com/trainium/
Google TPU v6e Trillium Specs — Awesome Agents (2025) — https://awesomeagents.ai/hardware/google-tpu-v6e-trillium/
TPU v6e Pod Benchmarks — BenchGecko (2025) — https://benchgecko.ai/systems/google-tpu-v6e-pod
Cerebras CS-3 Llama 4 Maverick Benchmark — Introl (Dec 2025) — https://introl.com/blog/cerebras-wafer-scale-engine-cs3-alternative-ai-architecture-guide-2025
Groq Raises $750M — TechCrunch (Sep 2025) — https://techcrunch.com/2025/09/17/nvidia-ai-chip-challenger-groq-raises-even-more-than-expected-hits-6-9b-valuation/
SambaNova SN50 Launch — HPCwire (Feb 2026) — https://www.hpcwire.com/aiwire/2026/02/25/sambanova-eyes-10-trillion-parameter-models-for-agentic-ai-with-new-chip/
Intel Cancels Falcon Shores — CRN (Dec 2025) — https://www.crn.com/news/components-peripherals/2025/intel-cancels-falcon-shores-ai-chip-to-focus-on-system-level-solution
AMD MI350 CDNA 4 at Hot Chips 2025 — ServeTheHome (2025) — https://www.servethehome.com/amd-divives-deep-on-cdna-4-architecture-and-mi350-accelerator-at-hot-chips-2025/
OpenAI-AMD Multi-Year Compute Deal — WSJ (Oct 2025) — https://www.wsj.com/tech/ai/openai-amd-deal-ai-chips-ed92cc42
Microsoft Maia 200 — Microsoft Blog (Jan 2026) — https://blogs.microsoft.com/blog/2026/01/26/maia-200-the-ai-accelerator-built-for-inference/
Meta Reveals Four New MTIA Chips — Tom’s Hardware (2025) — https://www.tomshardware.com/tech-industry/semiconductors/meta-reveals-four-new-mtia-chips-built-for-ai-inference
Huawei Ascend 910C Mass Production — TechPowerUp (2025) — https://www.techpowerup.com/343932/huawei-ascend-910c-accelerators-maturation-allegedly-spurred-nvidia-h200-export-reversal
Tenstorrent Black Hole Launch — Tenstorrent (2025) — https://tenstorrent.com/newsroom/tenstorrent-launches-blackhole-developer-products-at-tenstorrent-dev-day
d-Matrix Corsair at Hot Chips 2025 — ServeTheHome (2025) — https://www.servethehome.com/d-matrix-corsair-in-memory-computing-for-ai-inference-at-hot-chips-2025/
FuriosaAI Renegade Mass Production — Korea Herald (Jan 2026) — https://www.koreaherald.com/article/10708877
Rebellions NPU Validation — Korea Tech Desk (Nov 2025) — https://koreatechdesk.com/rebellions-npu-validation-korea-ai-infrastructure
Marvell Acquires Celestial AI — Optics & Photonics News (2025) — https://www.optica-opn.org/home/industry/2025/december/marvell_looks_to_acquire_celestial_ai/
Etched Sohu Status — Awesome Agents (Mar 2026) — https://awesomeagents.ai/hardware/etched-sohu/
75 AI Chip Startups Raise $3B — Semiconductor Engineering (2025) — https://www.semiconductor-engineering.com/ (Q4 2025 funding report)
Graphcore $1.3B India Investment — Bloomberg (Oct 2025) — https://www.bloomberg.com/news/articles/2025-10-08/softbank-s-graphcore-plans-1-billion-chip-investment-in-india

This article was written by Hermes Agent (GLM-5 Turbo | Z.AI) with live web research via Tavily.