LLM Vol.3 :: Reasoning Models, Inference Scaling & Architecture

Chain-of-Thought

Prompting or training models to produce explicit intermediate reasoning steps before a final answer. Unlocks performance on multi-step math, logic, and code :: especially powerful at scale.

Test-Time Compute

Allocating more compute at inference :: not training :: to generate better answers. Models "think longer" by sampling multiple reasoning paths and selecting the best via majority vote or reward models.

Mixture of Experts

MoE routes each token to a sparse subset of "expert" sub-networks. Models with trillions of total parameters activate only billions per token :: frontier scale at a fraction of the inference cost.

Scaling Laws

Empirical relationships between model size, training data, compute budget, and capability. Chinchilla scaling revealed that most 2020-era models were undertrained :: data scales as fast as parameters.

Model Distillation

Training a smaller "student" model to mimic the outputs of a larger "teacher." DeepSeek-R1 distilled reasoning capability into 7B and 14B models that outperform much larger base models on benchmarks.

RLVR Training

Reinforcement Learning with Verifiable Rewards :: training LLMs on tasks with ground-truth answers (math, code) removes the need for human preference labels and dramatically improves reasoning reliability.

Speculative Decoding

A small draft model proposes multiple candidate tokens in parallel; a larger model verifies them in one forward pass. Achieves 2–4× speedup with identical output to greedy decoding.

KV Cache & Efficiency

Key-value caching avoids recomputing attention for every token. At 200K context, this is the difference between practical and impossible :: enabling long-document analysis and multi-turn agents.

Deep Dive

The Reasoning
Revolution

OpenAI's o1 demonstrated that spending more compute at inference time :: generating longer chains of thought :: could unlock qualitative capability jumps on tasks that stumped previous models: graduate-level math, competitive coding, and scientific reasoning.

o1 scored in the 89th percentile on competitive programming problems. Its predecessor GPT-4 scored in the 11th percentile :: an 8× relative improvement through reasoning alone, not additional pretraining.

DeepSeek-R1 showed that reinforcement learning with verifiable rewards (RLVR) :: training on problems with objectively correct answers :: can instill sophisticated reasoning without expensive human preference data. Its 7B distilled variant outperforms GPT-4 on math benchmarks.

Architecture

Why Mixture
of Experts

Dense transformers scale parameter count linearly with compute cost. Mixture-of-Experts decouples the two: a routing layer selects 2–8 specialist sub-networks per token, leaving the rest dormant. GPT-4 is widely reported to use MoE with ~8 experts, activating ~2 per token.

With MoE, a model can have 1 trillion total parameters but activate only ~100B per forward pass :: matching the inference cost of a much smaller dense model while retaining the capacity of a much larger one.

The trade-offs: MoE models require more memory to store all expert weights, have higher communication overhead in distributed setups, and can suffer from load-imbalance where some experts are consistently over-selected. Auxiliary loss terms during training encourage balanced routing.

Key challenge: MoE models often perform worse than dense models in few-shot transfer settings, as experts specialize during pretraining and may not generalize as flexibly to new tasks.

Comparison

Reasoning Model Landscape

Model	Training Approach	Reasoning Method	Parameters	Strength
OpenAI o1 / o3	RLHF + process reward models	Internal chain-of-thought (hidden)	Undisclosed	Math & Science
DeepSeek-R1	RLVR on verifiable tasks	Explicit long CoT, visible to user	671B MoE (37B active)	Cost Efficiency
DeepSeek-R1 Distilled	Knowledge distillation from R1	CoT inherited from teacher	7B / 14B / 32B	Small-scale SOTA
QwQ-32B	RLVR + self-improvement	Extended reasoning traces	32B dense	Open weights
Claude (Extended Thinking)	Constitutional AI + RLHF	Visible scratchpad thinking	Undisclosed	Safety + Reasoning
Gemini Thinking	Multimodal RLHF	Internal multi-hypothesis reasoning	Undisclosed	Multimodal

The Reasoning
Milestones

From chain-of-thought prompting to test-time compute scaling :: the key moments that defined the reasoning revolution.

2022

Chain-of-Thought Prompting

Wei et al. showed that adding "let's think step by step" to prompts dramatically improved LLM performance on multi-step reasoning tasks :: no fine-tuning required.

2023

Process Reward Models

OpenAI trained reward models that score intermediate reasoning steps, not just final answers :: enabling reinforcement learning that improves each step of a solution chain.

2024 · Sept

OpenAI o1 Launch

o1 demonstrated that scaling inference compute (generating longer CoT) could match or exceed much larger dense models on math and coding :: the "inference scaling law" era begins.

2025 · Jan

DeepSeek-R1 & RLVR

Open-weight reasoning model trained with verifiable rewards :: no human preference data. Its 7B distilled variant beat GPT-4 on MATH benchmarks, democratizing frontier reasoning.

2025

Inference Scaling Laws

Research confirmed that more inference compute (best-of-N sampling, MCTS, self-refinement) trades off predictably against accuracy :: giving practitioners a new performance lever.

2025–2026

Reasoning as Default

Every frontier model :: GPT-5, Claude Opus 4.6, Gemini 3, Grok 4 :: now ships with reasoning/thinking modes. The debate shifts from "can LLMs reason?" to "how to route efficiently."