Reasoning models generate intermediate thinking steps before producing a final answer :: dramatically improving performance on complex tasks like math, coding, and multi-step logic.
The frontier of LLM research: how reasoning models like o1 and DeepSeek-R1 use test-time compute to solve harder problems, why mixture-of-experts unlocks massive scale at manageable cost, and what scaling laws tell us about where AI is heading.
Reasoning models generate intermediate thinking steps before producing a final answer :: dramatically improving performance on complex tasks like math, coding, and multi-step logic.
Prompting or training models to produce explicit intermediate reasoning steps before a final answer. Unlocks performance on multi-step math, logic, and code :: especially powerful at scale.
Allocating more compute at inference :: not training :: to generate better answers. Models "think longer" by sampling multiple reasoning paths and selecting the best via majority vote or reward models.
MoE routes each token to a sparse subset of "expert" sub-networks. Models with trillions of total parameters activate only billions per token :: frontier scale at a fraction of the inference cost.
Empirical relationships between model size, training data, compute budget, and capability. Chinchilla scaling revealed that most 2020-era models were undertrained :: data scales as fast as parameters.
Training a smaller "student" model to mimic the outputs of a larger "teacher." DeepSeek-R1 distilled reasoning capability into 7B and 14B models that outperform much larger base models on benchmarks.
Reinforcement Learning with Verifiable Rewards :: training LLMs on tasks with ground-truth answers (math, code) removes the need for human preference labels and dramatically improves reasoning reliability.
A small draft model proposes multiple candidate tokens in parallel; a larger model verifies them in one forward pass. Achieves 2–4× speedup with identical output to greedy decoding.
Key-value caching avoids recomputing attention for every token. At 200K context, this is the difference between practical and impossible :: enabling long-document analysis and multi-turn agents.
OpenAI's o1 demonstrated that spending more compute at inference time :: generating longer chains of thought :: could unlock qualitative capability jumps on tasks that stumped previous models: graduate-level math, competitive coding, and scientific reasoning.
DeepSeek-R1 showed that reinforcement learning with verifiable rewards (RLVR) :: training on problems with objectively correct answers :: can instill sophisticated reasoning without expensive human preference data. Its 7B distilled variant outperforms GPT-4 on math benchmarks.
Dense transformers scale parameter count linearly with compute cost. Mixture-of-Experts decouples the two: a routing layer selects 2–8 specialist sub-networks per token, leaving the rest dormant. GPT-4 is widely reported to use MoE with ~8 experts, activating ~2 per token.
The trade-offs: MoE models require more memory to store all expert weights, have higher communication overhead in distributed setups, and can suffer from load-imbalance where some experts are consistently over-selected. Auxiliary loss terms during training encourage balanced routing.
| Model | Training Approach | Reasoning Method | Parameters | Strength |
|---|---|---|---|---|
| OpenAI o1 / o3 | RLHF + process reward models | Internal chain-of-thought (hidden) | Undisclosed | Math & Science |
| DeepSeek-R1 | RLVR on verifiable tasks | Explicit long CoT, visible to user | 671B MoE (37B active) | Cost Efficiency |
| DeepSeek-R1 Distilled | Knowledge distillation from R1 | CoT inherited from teacher | 7B / 14B / 32B | Small-scale SOTA |
| QwQ-32B | RLVR + self-improvement | Extended reasoning traces | 32B dense | Open weights |
| Claude (Extended Thinking) | Constitutional AI + RLHF | Visible scratchpad thinking | Undisclosed | Safety + Reasoning |
| Gemini Thinking | Multimodal RLHF | Internal multi-hypothesis reasoning | Undisclosed | Multimodal |
From chain-of-thought prompting to test-time compute scaling :: the key moments that defined the reasoning revolution.
Wei et al. showed that adding "let's think step by step" to prompts dramatically improved LLM performance on multi-step reasoning tasks :: no fine-tuning required.
OpenAI trained reward models that score intermediate reasoning steps, not just final answers :: enabling reinforcement learning that improves each step of a solution chain.
o1 demonstrated that scaling inference compute (generating longer CoT) could match or exceed much larger dense models on math and coding :: the "inference scaling law" era begins.
Open-weight reasoning model trained with verifiable rewards :: no human preference data. Its 7B distilled variant beat GPT-4 on MATH benchmarks, democratizing frontier reasoning.
Research confirmed that more inference compute (best-of-N sampling, MCTS, self-refinement) trades off predictably against accuracy :: giving practitioners a new performance lever.
Every frontier model :: GPT-5, Claude Opus 4.6, Gemini 3, Grok 4 :: now ships with reasoning/thinking modes. The debate shifts from "can LLMs reason?" to "how to route efficiently."