What Is a Large Language Model?
At its core, a Large Language Model is a neural network trained on massive amounts of text to predict what comes next in a sequence :: yet from this simple objective emerges remarkable intelligence.
LLMs are built on the transformer architecture and trained on billions to trillions of text tokens drawn from the internet, books, code repositories, and scientific literature. The "large" refers to the sheer scale: modern frontier models contain hundreds of billions of learnable parameters :: the adjustable weights that encode linguistic patterns, factual knowledge, and reasoning capabilities simultaneously.
Unlike earlier recurrent neural networks that processed sequences token by token, transformers process entire sequences in parallel using a mechanism called self-attention :: enabling both dramatically faster training on GPUs and richer long-range understanding of context. This architectural leap, combined with scale and sophisticated training techniques, produced models capable of writing code, analyzing legal contracts, passing medical exams, and conversing fluently across dozens of languages.
Transformer Architecture
The "attention is all you need" revolution: how transformers use self-attention to weigh every word against every other word, capturing meaning across vast distances in text.
Every modern LLM is built on the transformer architecture, introduced by Google researchers in 2017. The key innovation is multi-head self-attention: for each token in a sequence, the model learns to attend to all other tokens with varying degrees of relevance :: allowing it to resolve pronoun references, track subject-object relationships, and understand nuance across thousands of tokens of context.
Stacked transformer blocks :: each containing attention layers and feed-forward networks :: build increasingly abstract representations. Early layers capture syntax and surface patterns; deeper layers encode semantics, world knowledge, and complex reasoning patterns. GPT-4, Claude, and Gemini all use decoder-only transformer variants, while some models use the full encoder-decoder design for tasks like translation.
Training: From Text to Intelligence
Three phases transform raw compute and data into a helpful, safe, and capable assistant :: pretraining, supervised fine-tuning, and reinforcement learning from human feedback.
Phase 1 :: Pretraining: The model predicts the next token across trillions of examples. This demands thousands of GPUs running for months and produces a "base model" that understands language deeply but has no particular goal or alignment.
Phase 2 :: Supervised Fine-Tuning (SFT): The base model is fine-tuned on high-quality human-written demonstrations of desired behavior :: transforming it from a raw language predictor into a capable instruction-following assistant.
Phase 3 :: RLHF: Reinforcement Learning from Human Feedback uses human preferences to train a reward model, which then guides the LLM via PPO to produce outputs that humans rate as more helpful, accurate, and harmless :: producing the polished models users interact with today.