Should I choose Google TPUs or NVIDIA GPUs for LLM training?

For very large-scale frontier pretraining with willingness to standardize on JAX/XLA, Google TPUs (v5p, Ironwood, or upcoming TPU 8t) are compelling. For minimum developer friction, maximum PyTorch compatibility, and broadest inference tooling, NVIDIA remains the strongest default.

Which accelerator has the most HBM memory in 2026?

As of 2026, Google TPU 8i leads with 288 GB HBM per chip, followed by AMD MI325X at 256 GB. NVIDIA B200 offers 180 GB per GPU.

Is TorchTPU production-ready in 2026?

TorchTPU was announced in preview in April 2026. PyTorch/XLA remains the more mature TPU path for production workloads.

How does Google Cloud TPU pricing compare to NVIDIA GPU pricing?

On Google Cloud, TPU v6e Trillium costs $2.70/chip-hour. An 8-chip slice totals ~$21.60/hr, versus $41.60/hr for an 8xH100 A3 High VM or $59.36/hr for an 8xH200 A3 Ultra VM.

Updated April 2026 · Source-backed deep dive

Google TPUs vs NVIDIA & AMD GPUs

A rigorous 2026 comparison of AI accelerator platforms architecture, benchmarks, software ecosystems, cloud pricing, and workload-driven procurement guidance for teams choosing between TPUs, NVIDIA, and AMD.

See Recommendations Explore Architecture

Google TPUs

NVIDIA Hopper & Blackwell

AMD CDNA3

Platforms Compared

Datacenter AI Accelerators · 2026

TPU v5e · v5p v6e Trillium Ironwood · 8t · 8i H100 · H200 · B200 MI300X · MI325X

Decision Framework

Training

Scale & cost

Inference

Latency & throughput

Software

Ecosystem fit

Output

Workload-matched accelerator choice

The Challenge

No single accelerator wins every workload.

•Training and inference reward fundamentally different hardware traits.
•Software stack lock-in can cost more than hardware price differences.
•MLPerf doesn't yet cover all platforms symmetrically in 2026.
•TPU 8t/8i are newly announced no mature cross-vendor benchmarks yet.

The Answer

Match hardware to your workload, stack, and team.

•Google TPUs best for large-scale training and Google-stack workloads.
•NVIDIA GPUs broadest ecosystem, lowest migration friction, safest default.
•AMD GPUs memory-capacity leader with an improving open stack.
•Operational fit and staff expertise frequently outweigh peak silicon specs.

Timeline

Platform progression shapes the 2026 landscape.

Accelerator choice in 2026 is shaped as much by software-stack maturity and availability timing as by raw silicon specs.

2023

TPU v4 ISCA paper published

Established benchmark for performance/watt gains (2.7× vs v3) and reconfigurable optical interconnects at scale.

2024

Trillium first public training benchmarks · AMD MI300X cloud deployments

MLPerf 4.1: Trillium delivers 1.8× better perf/dollar vs v5p, 99% scaling efficiency. AMD MI300X enters mainstream cloud with 192 GB HBM.

2025

Ironwood GA · H200 wide deployment · AMD MI325X (256 GB)

TPU7x Ironwood reaches GA. Google Cloud publishes H200 & B200 MLPerf inference results. AMD MI325X raises memory bar to 256 GB HBM3E.

April 2026 · Now

TorchTPU preview · TPU 8t & TPU 8i announced (April 22, 2026)

Native PyTorch on TPU enters preview. TPU 8t purpose-built for training (12.6 PFLOPS FP4, 216 GB). TPU 8i purpose-built for reasoning/serving (10.1 PFLOPS FP4, 288 GB, Boardfly topology).

Architecture

One platform. Three hardware families. Complete tradeoffs.

TPUs are AI ASICs designed around matrix execution, HBM, and pod-level interconnects. GPUs are programmable parallel processors their generality is both their strength and a source of efficiency overhead relative to domain-specific silicon.

GPU System Model

NVIDIA / AMD

SM / CU + Tensor & Matrix Cores
Registers & Shared / L1 Cache
L2 / Last-Level Cache
HBM (device memory)
NVLink / Infinity Fabric
InfiniBand / Ethernet

TPU System Model

Google Cloud

TensorCore / MXU (matrix)
SparseCore (v5p, v6e, 8t)
On-chip SRAM / VMEM
HBM (compiler-managed, XLA)
ICI (Inter-Chip Interconnect)
OCS / Virgo / Boardfly

Key Design Split

What it means

TPU pods = cohesive supercomputers
GPU clusters = assembled flexible nodes
TPU efficiency relies on XLA compiler
GPU efficiency relies on software stack
TPU 8t/8i: training vs. serving split
B200 spends watts on generality

Representative Platform Specifications

Platform	Chip	HBM	Peak BW	Scale-Up Interconnect	Max Pod Scale
Google TPU	v5e	16 GB	800 GiB/s	ICI	256-chip slice
Google TPU	v6e Trillium	32 GB	1,638 GiB/s	ICI 800 GB/s	256-chip 2D-torus
Google TPU	v5p	95 GiB	2,575 GiB/s	ICI 1,200 GB/s	3D torus · 6,144 chips
Google TPU	TPU7x Ironwood	192 GiB HBM3E	7,380 GiB/s	ICI 1,200 GB/s	9,216-chip pod
NEW · TPU 8t	TPU 8t	216 GB + 128 MB SRAM	6,528 GB/s	Torus + Virgo	9,600-chip; 134K+ via Virgo
NEW · TPU 8i	TPU 8i	288 GB + 384 MB SRAM	8,601 GB/s	Boardfly	1,024 active chips/pod
NVIDIA GPU	H200	141 GB HBM3e	4.8 TB/s	NVLink 900 GB/s+	8-GPU HGX node
NVIDIA GPU	HGX B200	180 GB / GPU	62 TB/s (8-GPU)	5th-gen NVLink/NVSwitch	8-GPU HGX node
AMD GPU	MI300X	192 GB HBM3	5.3 TB/s	7× Infinity Fabric	8-GPU OAM node
AMD GPU	MI325X	256 GB HBM3E	6.0 TB/s	7× Infinity Fabric	8-GPU OAM node

Published Peak Specs

Official vendor figures no invented numbers.

TPU figures emphasize BF16/FP8/FP4 in line with Google's public framing. Where standalone FLOP tables were absent in official sources, no figures are invented.

TPU v6e Trillium

918 TFLOPS

BF16

32 GB HBM · 1,638 GiB/s

TPU7x Ironwood

2,307 TFLOPS

BF16

192 GiB HBM3E · 7,380 GiB/s

TPU 8t · NEW

12.6 PFLOPS

FP4

216 GB HBM · 6,528 GB/s

TPU 8i · NEW

10.1 PFLOPS

FP4

288 GB HBM · 8,601 GB/s

NVIDIA H200

>32 PFLOPS

FP8 · 8-GPU HGX

141 GB HBM3e · 4.8 TB/s

HGX B200

144 PFLOPS

FP4 · 8-GPU total

180 GB/GPU · 62 TB/s total

AMD MI300X

1,307 TFLOPS

BF16

192 GB HBM3 · 5.3 TB/s

AMD MI325X

1,307 TFLOPS

BF16

256 GB HBM3E · 6.0 TB/s

Performance

Built for real-world AI performance benchmarks.

Training Scale

TPU 8t: 1M+ chips in one cluster

v5p scales to 6,144 chips (18,432 with Multislice). Ironwood reaches 9,216-chip pods. TPU 8t targets >1M chips in a single training cluster via JAX + Pathways, with Virgo linking 134K+ chips non-blocking.

MLPerf Training

Trillium: 1.8× better $/perf vs v5p

MLPerf Training 4.1: 99% scaling efficiency. AMD MI300X showed Llama 2 70B LoRA at 29.6 min (1 node) to 10.9 min (4 nodes) confirming ROCm/MI300X is real and scaling.

Inference Throughput

1,703 tokens/s on Llama 3.1 405B

JetStream + Pathways on Trillium. Trillium exceeded v5e by 2.9× for Llama 2 70B. Disaggregated serving improved time-to-first-token by 7× and token-generation by nearly 3×.

NVIDIA H200

1.9× Llama 2 70B vs. H100

H200's 141 GB HBM3e positions it for large LLM inference. Google Cloud's A3 Ultra (8× H200) achieved results comparable to NVIDIA's peak GPU MLPerf Inference v5.0 submissions.

TPU 8i · Reasoning

5× lower collective latency

Collectives Acceleration Engine: 5× lower on-chip collective latency. Boardfly topology: up to 50% less all-to-all latency, network diameter shrinks from 16 to 7 hops at 1,024-chip scale.

Energy Efficiency

TPU 8t/8i: 2× perf/watt vs. Ironwood

TPU v4 achieved 2.7× perf/watt vs. v3 and was 1.2–1.7× faster than A100. All three vendors now operate in the 750–1,000 W TBP range at the high end but spend those watts differently.

Software Ecosystem

If you value ease of development under uncertainty, this section decides the purchase.

The Reality

CUDA's "mental cache" is deep and sticky.

Enterprise teams have years of CUDA assumptions baked into model code, profiling workflows, and deployment pipelines. CUDA, cuDNN, NCCL, TensorRT, and Nsight form the most mature AI software stack in existence. Migration friction is real and often dominates total cost of ownership.

The Shift

TPU and AMD gaps are narrowing fast.

TorchTPU (April 2026) enables Eager First PyTorch on TPUs. AMD ROCm now supports PyTorch, TensorFlow, Triton, JAX, and ONNX Runtime with RCCL, ROCProfiler, and ROCm Profiler. Both are promising but neither yet matches CUDA's operational depth for enterprise production.

NVIDIA · CUDA Stack

Most mature & broadly compatible

•CUDA + cuDNN + NCCL + TensorRT
•Nsight Systems / Compute profiling
•Widest third-party library support
•Lowest migration risk for PyTorch teams

AMD · ROCm Stack

Substantially improved, open stack

•PyTorch, TensorFlow, Triton, JAX, ONNX
•RCCL for collectives; ROCProfiler suite
•ROCm Compute & Systems Profiler
•Open-source; improving quickly

Google TPU · JAX + TorchTPU

Two paths: established & emerging

•JAX / XLA / Pathways / MaxText (mature)
•TorchTPU "Eager First" (April 2026 preview)
•XProf / TensorBoard TPU profiling
•PyTorch/XLA remains most mature path

Cloud Pricing & TCO

A clear, structured path to understanding accelerator economics.

TPU prices are on-demand per chip-hour. GPU prices are Dynamic Workload Scheduler calendar-mode VM-hour prices for 8-GPU nodes. Different commercial modes use as directional anchors, not final quotes.

Google Cloud TPU per chip, on-demand

TPU v5e

us-central1

$1.20 / chip-hr

TPU v6e Trillium

us-east1 · best price-performance

$2.70 / chip-hr

TPU v5p

us-east1 / us-east5

$4.20 / chip-hr

TPU7x Ironwood

us-central1 · premium large-scale

$12.00 / chip-hr

8-chip Trillium ≈ $21.60/hr vs $41.60/hr for 8×H100

Google Cloud GPU VMs per VM (8 GPUs), calendar mode

A3 High 8× H100

calendar mode

$41.60 / VM-hr

A3 Mega 8× H100

calendar mode

$44.00 / VM-hr

A3 Ultra 8× H200

calendar mode

$59.36 / VM-hr

A4 8× B200

calendar mode · substantial premium

$90.22 / VM-hr

B200 premium pays off only when Blackwell-class throughput is actually consumed

Illustrative Annual Cloud Cost Scenarios

Computed directly from official public hourly prices. Excludes storage, networking, and reservation discounts. Directional only.

Scenario	TPU v5e	TPU v6e	TPU v5p	Ironwood	8×H100	8×H200	8×B200
8 accelerators · 1 year	$84k	$189k	$294k	$841k	$364k	$520k	$790k
256 accelerators · 6 weeks	$310k	$697k	$1.08M	$3.10M	$1.34M	$1.91M	$2.91M

Workload Recommendations

A clear path from raw workload to the right hardware no guesswork.

This matrix weights operational fit almost as heavily as silicon capability. In real projects, migration risk and team expertise determine total cost and calendar time as much as FLOP counts do.

Frontier Pretraining

TPU v5p · Ironwood · TPU 8t

•Pod-scale ICI/OCS eliminates communication bottlenecks at scale
•Up to 1M+ chips in one cluster via JAX + Pathways
•Requires willingness to standardize on JAX/XLA

Mainstream Training / Fine-tuning

Trillium (cost) · H100 (portability)

•Trillium: ~$21.60/hr for 8 chips vs $41.60 for 8×H100
•H100: universal enterprise baseline, easiest portability
•Choose based on team's framework comfort level

Large-Memory Training / Inference

AMD MI300X · MI325X

•192 GB (MI300X) and 256 GB (MI325X) memory-capacity leaders
•Valuable for large models, adapters, memory-bound inference
•Requires more ecosystem qualification than CUDA

Production Inference (today)

NVIDIA H200 / B200

•Deepest production inference tooling: TensorRT, vLLM, Triton
•H200: 1.9× Llama 2 70B vs H100, 141 GB memory
•Watch TPU 8i purpose-built for this as it matures

Mixed Research + Enterprise

NVIDIA (safest default)

•Broadest framework, library, and custom kernel support
•Most predictable developer experience for PyTorch teams
•Lowest migration risk for mixed research/production estates

Embedding / Recommendation

Google TPUs (SparseCore)

•SparseCore on v5p, v6e, Ironwood, TPU 8t purpose-built for sparse embedding
•Google-specific system co-design most valuable here
•XLA/JAX integration maximizes SparseCore efficiency

Open Questions & Limitations

Validate on your actual workload before committing capital.

•No exact workload specified all recommendations are workload-category guidance, not workload-specific benchmarks.
•TPU 8t / 8i newly announced announced April 22, 2026, not yet covered by mature cross-vendor benchmarks.
•MLPerf not fully symmetric no single benchmark yet covers all current TPU, Hopper, Blackwell, and AMD CDNA configurations in the same scenario.
•Public hardware prices opaque buyers procure through negotiated channels. Cloud prices are more reliable anchors than street price speculation.
•Strong TPU inference evidence is vendor-run valuable Google-run benchmarks, but not equivalent to vendor-neutral MLPerf submissions.

Primary Sources

All data comes from official vendor documentation, datasheets, and academic papers. No secondary estimates invented.

Google Cloud TPU v6e Docs TPU 8t & 8i Technical Deep Dive Google Cloud TPU Pricing NVIDIA H200 Product Page AMD MI300X Datasheet TPU v4 Paper (arXiv) MLCommons Training Benchmarks