Google TPUs vs NVIDIA & AMD GPUs
A rigorous 2026 comparison of AI accelerator platforms architecture, benchmarks, software ecosystems, cloud pricing, and workload-driven procurement guidance for teams choosing between TPUs, NVIDIA, and AMD.
Platforms Compared
Datacenter AI Accelerators · 2026
Training
Scale & cost
Inference
Latency & throughput
Software
Ecosystem fit
Output
Workload-matched accelerator choice
The Challenge
No single accelerator wins every workload.
- •Training and inference reward fundamentally different hardware traits.
- •Software stack lock-in can cost more than hardware price differences.
- •MLPerf doesn't yet cover all platforms symmetrically in 2026.
- •TPU 8t/8i are newly announced no mature cross-vendor benchmarks yet.
The Answer
Match hardware to your workload, stack, and team.
- •Google TPUs best for large-scale training and Google-stack workloads.
- •NVIDIA GPUs broadest ecosystem, lowest migration friction, safest default.
- •AMD GPUs memory-capacity leader with an improving open stack.
- •Operational fit and staff expertise frequently outweigh peak silicon specs.
Timeline
Platform progression shapes the 2026 landscape.
Accelerator choice in 2026 is shaped as much by software-stack maturity and availability timing as by raw silicon specs.
2023
TPU v4 ISCA paper published
Established benchmark for performance/watt gains (2.7× vs v3) and reconfigurable optical interconnects at scale.
2024
Trillium first public training benchmarks · AMD MI300X cloud deployments
MLPerf 4.1: Trillium delivers 1.8× better perf/dollar vs v5p, 99% scaling efficiency. AMD MI300X enters mainstream cloud with 192 GB HBM.
2025
Ironwood GA · H200 wide deployment · AMD MI325X (256 GB)
TPU7x Ironwood reaches GA. Google Cloud publishes H200 & B200 MLPerf inference results. AMD MI325X raises memory bar to 256 GB HBM3E.
April 2026 · Now
TorchTPU preview · TPU 8t & TPU 8i announced (April 22, 2026)
Native PyTorch on TPU enters preview. TPU 8t purpose-built for training (12.6 PFLOPS FP4, 216 GB). TPU 8i purpose-built for reasoning/serving (10.1 PFLOPS FP4, 288 GB, Boardfly topology).
Architecture
One platform. Three hardware families. Complete tradeoffs.
TPUs are AI ASICs designed around matrix execution, HBM, and pod-level interconnects. GPUs are programmable parallel processors their generality is both their strength and a source of efficiency overhead relative to domain-specific silicon.
GPU System Model
NVIDIA / AMD
- SM / CU + Tensor & Matrix Cores
- Registers & Shared / L1 Cache
- L2 / Last-Level Cache
- HBM (device memory)
- NVLink / Infinity Fabric
- InfiniBand / Ethernet
TPU System Model
Google Cloud
- TensorCore / MXU (matrix)
- SparseCore (v5p, v6e, 8t)
- On-chip SRAM / VMEM
- HBM (compiler-managed, XLA)
- ICI (Inter-Chip Interconnect)
- OCS / Virgo / Boardfly
Key Design Split
What it means
- TPU pods = cohesive supercomputers
- GPU clusters = assembled flexible nodes
- TPU efficiency relies on XLA compiler
- GPU efficiency relies on software stack
- TPU 8t/8i: training vs. serving split
- B200 spends watts on generality
Representative Platform Specifications
| Platform | Chip | HBM | Peak BW | Scale-Up Interconnect | Max Pod Scale |
|---|---|---|---|---|---|
| Google TPU | v5e | 16 GB | 800 GiB/s | ICI | 256-chip slice |
| Google TPU | v6e Trillium | 32 GB | 1,638 GiB/s | ICI 800 GB/s | 256-chip 2D-torus |
| Google TPU | v5p | 95 GiB | 2,575 GiB/s | ICI 1,200 GB/s | 3D torus · 6,144 chips |
| Google TPU | TPU7x Ironwood | 192 GiB HBM3E | 7,380 GiB/s | ICI 1,200 GB/s | 9,216-chip pod |
| NEW · TPU 8t | TPU 8t | 216 GB + 128 MB SRAM | 6,528 GB/s | Torus + Virgo | 9,600-chip; 134K+ via Virgo |
| NEW · TPU 8i | TPU 8i | 288 GB + 384 MB SRAM | 8,601 GB/s | Boardfly | 1,024 active chips/pod |
| NVIDIA GPU | H200 | 141 GB HBM3e | 4.8 TB/s | NVLink 900 GB/s+ | 8-GPU HGX node |
| NVIDIA GPU | HGX B200 | 180 GB / GPU | 62 TB/s (8-GPU) | 5th-gen NVLink/NVSwitch | 8-GPU HGX node |
| AMD GPU | MI300X | 192 GB HBM3 | 5.3 TB/s | 7× Infinity Fabric | 8-GPU OAM node |
| AMD GPU | MI325X | 256 GB HBM3E | 6.0 TB/s | 7× Infinity Fabric | 8-GPU OAM node |
Published Peak Specs
Official vendor figures no invented numbers.
TPU figures emphasize BF16/FP8/FP4 in line with Google's public framing. Where standalone FLOP tables were absent in official sources, no figures are invented.
918 TFLOPS
BF16
32 GB HBM · 1,638 GiB/s
2,307 TFLOPS
BF16
192 GiB HBM3E · 7,380 GiB/s
12.6 PFLOPS
FP4
216 GB HBM · 6,528 GB/s
10.1 PFLOPS
FP4
288 GB HBM · 8,601 GB/s
>32 PFLOPS
FP8 · 8-GPU HGX
141 GB HBM3e · 4.8 TB/s
144 PFLOPS
FP4 · 8-GPU total
180 GB/GPU · 62 TB/s total
1,307 TFLOPS
BF16
192 GB HBM3 · 5.3 TB/s
1,307 TFLOPS
BF16
256 GB HBM3E · 6.0 TB/s
Performance
Built for real-world AI performance benchmarks.
TPU 8t: 1M+ chips in one cluster
v5p scales to 6,144 chips (18,432 with Multislice). Ironwood reaches 9,216-chip pods. TPU 8t targets >1M chips in a single training cluster via JAX + Pathways, with Virgo linking 134K+ chips non-blocking.
Trillium: 1.8× better $/perf vs v5p
MLPerf Training 4.1: 99% scaling efficiency. AMD MI300X showed Llama 2 70B LoRA at 29.6 min (1 node) to 10.9 min (4 nodes) confirming ROCm/MI300X is real and scaling.
1,703 tokens/s on Llama 3.1 405B
JetStream + Pathways on Trillium. Trillium exceeded v5e by 2.9× for Llama 2 70B. Disaggregated serving improved time-to-first-token by 7× and token-generation by nearly 3×.
1.9× Llama 2 70B vs. H100
H200's 141 GB HBM3e positions it for large LLM inference. Google Cloud's A3 Ultra (8× H200) achieved results comparable to NVIDIA's peak GPU MLPerf Inference v5.0 submissions.
5× lower collective latency
Collectives Acceleration Engine: 5× lower on-chip collective latency. Boardfly topology: up to 50% less all-to-all latency, network diameter shrinks from 16 to 7 hops at 1,024-chip scale.
TPU 8t/8i: 2× perf/watt vs. Ironwood
TPU v4 achieved 2.7× perf/watt vs. v3 and was 1.2–1.7× faster than A100. All three vendors now operate in the 750–1,000 W TBP range at the high end but spend those watts differently.
Software Ecosystem
If you value ease of development under uncertainty, this section decides the purchase.
CUDA's "mental cache" is deep and sticky.
Enterprise teams have years of CUDA assumptions baked into model code, profiling workflows, and deployment pipelines. CUDA, cuDNN, NCCL, TensorRT, and Nsight form the most mature AI software stack in existence. Migration friction is real and often dominates total cost of ownership.
TPU and AMD gaps are narrowing fast.
TorchTPU (April 2026) enables Eager First PyTorch on TPUs. AMD ROCm now supports PyTorch, TensorFlow, Triton, JAX, and ONNX Runtime with RCCL, ROCProfiler, and ROCm Profiler. Both are promising but neither yet matches CUDA's operational depth for enterprise production.
- •CUDA + cuDNN + NCCL + TensorRT
- •Nsight Systems / Compute profiling
- •Widest third-party library support
- •Lowest migration risk for PyTorch teams
- •PyTorch, TensorFlow, Triton, JAX, ONNX
- •RCCL for collectives; ROCProfiler suite
- •ROCm Compute & Systems Profiler
- •Open-source; improving quickly
- •JAX / XLA / Pathways / MaxText (mature)
- •TorchTPU "Eager First" (April 2026 preview)
- •XProf / TensorBoard TPU profiling
- •PyTorch/XLA remains most mature path
Cloud Pricing & TCO
A clear, structured path to understanding accelerator economics.
TPU prices are on-demand per chip-hour. GPU prices are Dynamic Workload Scheduler calendar-mode VM-hour prices for 8-GPU nodes. Different commercial modes use as directional anchors, not final quotes.
Google Cloud TPU per chip, on-demand
TPU v5e
us-central1
$1.20 / chip-hr
TPU v6e Trillium
us-east1 · best price-performance
$2.70 / chip-hr
TPU v5p
us-east1 / us-east5
$4.20 / chip-hr
TPU7x Ironwood
us-central1 · premium large-scale
$12.00 / chip-hr
8-chip Trillium ≈ $21.60/hr vs $41.60/hr for 8×H100
Google Cloud GPU VMs per VM (8 GPUs), calendar mode
A3 High 8× H100
calendar mode
$41.60 / VM-hr
A3 Mega 8× H100
calendar mode
$44.00 / VM-hr
A3 Ultra 8× H200
calendar mode
$59.36 / VM-hr
A4 8× B200
calendar mode · substantial premium
$90.22 / VM-hr
B200 premium pays off only when Blackwell-class throughput is actually consumed
Illustrative Annual Cloud Cost Scenarios
Computed directly from official public hourly prices. Excludes storage, networking, and reservation discounts. Directional only.
| Scenario | TPU v5e | TPU v6e | TPU v5p | Ironwood | 8×H100 | 8×H200 | 8×B200 |
|---|---|---|---|---|---|---|---|
| 8 accelerators · 1 year | $84k | $189k | $294k | $841k | $364k | $520k | $790k |
| 256 accelerators · 6 weeks | $310k | $697k | $1.08M | $3.10M | $1.34M | $1.91M | $2.91M |
Workload Recommendations
A clear path from raw workload to the right hardware no guesswork.
This matrix weights operational fit almost as heavily as silicon capability. In real projects, migration risk and team expertise determine total cost and calendar time as much as FLOP counts do.
- •Pod-scale ICI/OCS eliminates communication bottlenecks at scale
- •Up to 1M+ chips in one cluster via JAX + Pathways
- •Requires willingness to standardize on JAX/XLA
- •Trillium: ~$21.60/hr for 8 chips vs $41.60 for 8×H100
- •H100: universal enterprise baseline, easiest portability
- •Choose based on team's framework comfort level
- •192 GB (MI300X) and 256 GB (MI325X) memory-capacity leaders
- •Valuable for large models, adapters, memory-bound inference
- •Requires more ecosystem qualification than CUDA
- •Deepest production inference tooling: TensorRT, vLLM, Triton
- •H200: 1.9× Llama 2 70B vs H100, 141 GB memory
- •Watch TPU 8i purpose-built for this as it matures
- •Broadest framework, library, and custom kernel support
- •Most predictable developer experience for PyTorch teams
- •Lowest migration risk for mixed research/production estates
- •SparseCore on v5p, v6e, Ironwood, TPU 8t purpose-built for sparse embedding
- •Google-specific system co-design most valuable here
- •XLA/JAX integration maximizes SparseCore efficiency
Open Questions & Limitations
Validate on your actual workload before committing capital.
- •No exact workload specified all recommendations are workload-category guidance, not workload-specific benchmarks.
- •TPU 8t / 8i newly announced announced April 22, 2026, not yet covered by mature cross-vendor benchmarks.
- •MLPerf not fully symmetric no single benchmark yet covers all current TPU, Hopper, Blackwell, and AMD CDNA configurations in the same scenario.
- •Public hardware prices opaque buyers procure through negotiated channels. Cloud prices are more reliable anchors than street price speculation.
- •Strong TPU inference evidence is vendor-run valuable Google-run benchmarks, but not equivalent to vendor-neutral MLPerf submissions.
Primary Sources
All data comes from official vendor documentation, datasheets, and academic papers. No secondary estimates invented.