Updated April 2026 · Source-backed deep dive

Google TPUs vs NVIDIA & AMD GPUs

A rigorous 2026 comparison of AI accelerator platforms architecture, benchmarks, software ecosystems, cloud pricing, and workload-driven procurement guidance for teams choosing between TPUs, NVIDIA, and AMD.

Google TPUs
NVIDIA Hopper & Blackwell
AMD CDNA3

Platforms Compared

Datacenter AI Accelerators · 2026

TPU v5e · v5p v6e Trillium Ironwood · 8t · 8i H100 · H200 · B200 MI300X · MI325X
Decision Framework

Training

Scale & cost

Inference

Latency & throughput

Software

Ecosystem fit

Output

Workload-matched accelerator choice

The Challenge

No single accelerator wins every workload.

  • Training and inference reward fundamentally different hardware traits.
  • Software stack lock-in can cost more than hardware price differences.
  • MLPerf doesn't yet cover all platforms symmetrically in 2026.
  • TPU 8t/8i are newly announced no mature cross-vendor benchmarks yet.

The Answer

Match hardware to your workload, stack, and team.

  • Google TPUs best for large-scale training and Google-stack workloads.
  • NVIDIA GPUs broadest ecosystem, lowest migration friction, safest default.
  • AMD GPUs memory-capacity leader with an improving open stack.
  • Operational fit and staff expertise frequently outweigh peak silicon specs.

Timeline

Platform progression shapes the 2026 landscape.

Accelerator choice in 2026 is shaped as much by software-stack maturity and availability timing as by raw silicon specs.

2023

TPU v4 ISCA paper published

Established benchmark for performance/watt gains (2.7× vs v3) and reconfigurable optical interconnects at scale.

2024

Trillium first public training benchmarks · AMD MI300X cloud deployments

MLPerf 4.1: Trillium delivers 1.8× better perf/dollar vs v5p, 99% scaling efficiency. AMD MI300X enters mainstream cloud with 192 GB HBM.

2025

Ironwood GA · H200 wide deployment · AMD MI325X (256 GB)

TPU7x Ironwood reaches GA. Google Cloud publishes H200 & B200 MLPerf inference results. AMD MI325X raises memory bar to 256 GB HBM3E.

April 2026 · Now

TorchTPU preview · TPU 8t & TPU 8i announced (April 22, 2026)

Native PyTorch on TPU enters preview. TPU 8t purpose-built for training (12.6 PFLOPS FP4, 216 GB). TPU 8i purpose-built for reasoning/serving (10.1 PFLOPS FP4, 288 GB, Boardfly topology).

Architecture

One platform. Three hardware families. Complete tradeoffs.

TPUs are AI ASICs designed around matrix execution, HBM, and pod-level interconnects. GPUs are programmable parallel processors their generality is both their strength and a source of efficiency overhead relative to domain-specific silicon.

GPU System Model

NVIDIA / AMD

  • SM / CU + Tensor & Matrix Cores
  • Registers & Shared / L1 Cache
  • L2 / Last-Level Cache
  • HBM (device memory)
  • NVLink / Infinity Fabric
  • InfiniBand / Ethernet

TPU System Model

Google Cloud

  • TensorCore / MXU (matrix)
  • SparseCore (v5p, v6e, 8t)
  • On-chip SRAM / VMEM
  • HBM (compiler-managed, XLA)
  • ICI (Inter-Chip Interconnect)
  • OCS / Virgo / Boardfly

Key Design Split

What it means

  • TPU pods = cohesive supercomputers
  • GPU clusters = assembled flexible nodes
  • TPU efficiency relies on XLA compiler
  • GPU efficiency relies on software stack
  • TPU 8t/8i: training vs. serving split
  • B200 spends watts on generality

Representative Platform Specifications

Platform Chip HBM Peak BW Scale-Up Interconnect Max Pod Scale
Google TPU v5e 16 GB 800 GiB/s ICI 256-chip slice
Google TPU v6e Trillium 32 GB 1,638 GiB/s ICI 800 GB/s 256-chip 2D-torus
Google TPU v5p 95 GiB 2,575 GiB/s ICI 1,200 GB/s 3D torus · 6,144 chips
Google TPU TPU7x Ironwood 192 GiB HBM3E 7,380 GiB/s ICI 1,200 GB/s 9,216-chip pod
NEW · TPU 8t TPU 8t 216 GB + 128 MB SRAM 6,528 GB/s Torus + Virgo 9,600-chip; 134K+ via Virgo
NEW · TPU 8i TPU 8i 288 GB + 384 MB SRAM 8,601 GB/s Boardfly 1,024 active chips/pod
NVIDIA GPU H200 141 GB HBM3e 4.8 TB/s NVLink 900 GB/s+ 8-GPU HGX node
NVIDIA GPU HGX B200 180 GB / GPU 62 TB/s (8-GPU) 5th-gen NVLink/NVSwitch 8-GPU HGX node
AMD GPU MI300X 192 GB HBM3 5.3 TB/s 7× Infinity Fabric 8-GPU OAM node
AMD GPU MI325X 256 GB HBM3E 6.0 TB/s 7× Infinity Fabric 8-GPU OAM node

Published Peak Specs

Official vendor figures no invented numbers.

TPU figures emphasize BF16/FP8/FP4 in line with Google's public framing. Where standalone FLOP tables were absent in official sources, no figures are invented.

TPU v6e Trillium

918 TFLOPS

BF16

32 GB HBM · 1,638 GiB/s

TPU7x Ironwood

2,307 TFLOPS

BF16

192 GiB HBM3E · 7,380 GiB/s

TPU 8t · NEW

12.6 PFLOPS

FP4

216 GB HBM · 6,528 GB/s

TPU 8i · NEW

10.1 PFLOPS

FP4

288 GB HBM · 8,601 GB/s

NVIDIA H200

>32 PFLOPS

FP8 · 8-GPU HGX

141 GB HBM3e · 4.8 TB/s

HGX B200

144 PFLOPS

FP4 · 8-GPU total

180 GB/GPU · 62 TB/s total

AMD MI300X

1,307 TFLOPS

BF16

192 GB HBM3 · 5.3 TB/s

AMD MI325X

1,307 TFLOPS

BF16

256 GB HBM3E · 6.0 TB/s

Performance

Built for real-world AI performance benchmarks.

Training Scale

TPU 8t: 1M+ chips in one cluster

v5p scales to 6,144 chips (18,432 with Multislice). Ironwood reaches 9,216-chip pods. TPU 8t targets >1M chips in a single training cluster via JAX + Pathways, with Virgo linking 134K+ chips non-blocking.

MLPerf Training

Trillium: 1.8× better $/perf vs v5p

MLPerf Training 4.1: 99% scaling efficiency. AMD MI300X showed Llama 2 70B LoRA at 29.6 min (1 node) to 10.9 min (4 nodes) confirming ROCm/MI300X is real and scaling.

Inference Throughput

1,703 tokens/s on Llama 3.1 405B

JetStream + Pathways on Trillium. Trillium exceeded v5e by 2.9× for Llama 2 70B. Disaggregated serving improved time-to-first-token by 7× and token-generation by nearly 3×.

NVIDIA H200

1.9× Llama 2 70B vs. H100

H200's 141 GB HBM3e positions it for large LLM inference. Google Cloud's A3 Ultra (8× H200) achieved results comparable to NVIDIA's peak GPU MLPerf Inference v5.0 submissions.

TPU 8i · Reasoning

5× lower collective latency

Collectives Acceleration Engine: 5× lower on-chip collective latency. Boardfly topology: up to 50% less all-to-all latency, network diameter shrinks from 16 to 7 hops at 1,024-chip scale.

Energy Efficiency

TPU 8t/8i: 2× perf/watt vs. Ironwood

TPU v4 achieved 2.7× perf/watt vs. v3 and was 1.2–1.7× faster than A100. All three vendors now operate in the 750–1,000 W TBP range at the high end but spend those watts differently.

Software Ecosystem

If you value ease of development under uncertainty, this section decides the purchase.

The Reality

CUDA's "mental cache" is deep and sticky.

Enterprise teams have years of CUDA assumptions baked into model code, profiling workflows, and deployment pipelines. CUDA, cuDNN, NCCL, TensorRT, and Nsight form the most mature AI software stack in existence. Migration friction is real and often dominates total cost of ownership.

The Shift

TPU and AMD gaps are narrowing fast.

TorchTPU (April 2026) enables Eager First PyTorch on TPUs. AMD ROCm now supports PyTorch, TensorFlow, Triton, JAX, and ONNX Runtime with RCCL, ROCProfiler, and ROCm Profiler. Both are promising but neither yet matches CUDA's operational depth for enterprise production.

NVIDIA · CUDA Stack
Most mature & broadly compatible
  • CUDA + cuDNN + NCCL + TensorRT
  • Nsight Systems / Compute profiling
  • Widest third-party library support
  • Lowest migration risk for PyTorch teams
AMD · ROCm Stack
Substantially improved, open stack
  • PyTorch, TensorFlow, Triton, JAX, ONNX
  • RCCL for collectives; ROCProfiler suite
  • ROCm Compute & Systems Profiler
  • Open-source; improving quickly
Google TPU · JAX + TorchTPU
Two paths: established & emerging
  • JAX / XLA / Pathways / MaxText (mature)
  • TorchTPU "Eager First" (April 2026 preview)
  • XProf / TensorBoard TPU profiling
  • PyTorch/XLA remains most mature path

Cloud Pricing & TCO

A clear, structured path to understanding accelerator economics.

TPU prices are on-demand per chip-hour. GPU prices are Dynamic Workload Scheduler calendar-mode VM-hour prices for 8-GPU nodes. Different commercial modes use as directional anchors, not final quotes.

Google Cloud TPU per chip, on-demand

TPU v5e

us-central1

$1.20 / chip-hr

TPU v6e Trillium

us-east1 · best price-performance

$2.70 / chip-hr

TPU v5p

us-east1 / us-east5

$4.20 / chip-hr

TPU7x Ironwood

us-central1 · premium large-scale

$12.00 / chip-hr

8-chip Trillium ≈ $21.60/hr vs $41.60/hr for 8×H100

Google Cloud GPU VMs per VM (8 GPUs), calendar mode

A3 High 8× H100

calendar mode

$41.60 / VM-hr

A3 Mega 8× H100

calendar mode

$44.00 / VM-hr

A3 Ultra 8× H200

calendar mode

$59.36 / VM-hr

A4 8× B200

calendar mode · substantial premium

$90.22 / VM-hr

B200 premium pays off only when Blackwell-class throughput is actually consumed

Illustrative Annual Cloud Cost Scenarios

Computed directly from official public hourly prices. Excludes storage, networking, and reservation discounts. Directional only.

Scenario TPU v5e TPU v6e TPU v5p Ironwood 8×H100 8×H200 8×B200
8 accelerators · 1 year $84k $189k $294k $841k $364k $520k $790k
256 accelerators · 6 weeks $310k $697k $1.08M $3.10M $1.34M $1.91M $2.91M

Workload Recommendations

A clear path from raw workload to the right hardware no guesswork.

This matrix weights operational fit almost as heavily as silicon capability. In real projects, migration risk and team expertise determine total cost and calendar time as much as FLOP counts do.

Frontier Pretraining
TPU v5p · Ironwood · TPU 8t
  • Pod-scale ICI/OCS eliminates communication bottlenecks at scale
  • Up to 1M+ chips in one cluster via JAX + Pathways
  • Requires willingness to standardize on JAX/XLA
Mainstream Training / Fine-tuning
Trillium (cost) · H100 (portability)
  • Trillium: ~$21.60/hr for 8 chips vs $41.60 for 8×H100
  • H100: universal enterprise baseline, easiest portability
  • Choose based on team's framework comfort level
Large-Memory Training / Inference
AMD MI300X · MI325X
  • 192 GB (MI300X) and 256 GB (MI325X) memory-capacity leaders
  • Valuable for large models, adapters, memory-bound inference
  • Requires more ecosystem qualification than CUDA
Production Inference (today)
NVIDIA H200 / B200
  • Deepest production inference tooling: TensorRT, vLLM, Triton
  • H200: 1.9× Llama 2 70B vs H100, 141 GB memory
  • Watch TPU 8i purpose-built for this as it matures
Mixed Research + Enterprise
NVIDIA (safest default)
  • Broadest framework, library, and custom kernel support
  • Most predictable developer experience for PyTorch teams
  • Lowest migration risk for mixed research/production estates
Embedding / Recommendation
Google TPUs (SparseCore)
  • SparseCore on v5p, v6e, Ironwood, TPU 8t purpose-built for sparse embedding
  • Google-specific system co-design most valuable here
  • XLA/JAX integration maximizes SparseCore efficiency

Open Questions & Limitations

Validate on your actual workload before committing capital.

  • No exact workload specified all recommendations are workload-category guidance, not workload-specific benchmarks.
  • TPU 8t / 8i newly announced announced April 22, 2026, not yet covered by mature cross-vendor benchmarks.
  • MLPerf not fully symmetric no single benchmark yet covers all current TPU, Hopper, Blackwell, and AMD CDNA configurations in the same scenario.
  • Public hardware prices opaque buyers procure through negotiated channels. Cloud prices are more reliable anchors than street price speculation.
  • Strong TPU inference evidence is vendor-run valuable Google-run benchmarks, but not equivalent to vendor-neutral MLPerf submissions.

Primary Sources

All data comes from official vendor documentation, datasheets, and academic papers. No secondary estimates invented.