Ch 14: The LLM Landscape — How LLMs Work

Ch 14 — The LLM Landscape

Capstone — the major players, open vs closed, what’s next, and connecting everything you’ve learned

arrow_backIndex

Capstone

public

Players

arrow_forward

lock_open

Open

arrow_forward

lock

Closed

arrow_forward

build

Agents

arrow_forward

map

Recap

arrow_forward

rocket_launch

Future

Click play or press Space to begin...

Step- / 6

public

The Major Players

Who builds frontier LLMs and what makes each unique

The Landscape

A handful of organizations build frontier LLMs: OpenAI (GPT-4, o1/o3), Anthropic (Claude 3.5/4), Google DeepMind (Gemini), Meta (Llama), Mistral (Mistral, Mixtral), DeepSeek (DeepSeek-V3, R1), Alibaba (Qwen), xAI (Grok). Each has a different philosophy: OpenAI prioritizes capability, Anthropic prioritizes safety, Meta prioritizes openness, DeepSeek prioritizes efficiency.

Key insight: The LLM landscape changes every few months. What matters is understanding the architecture (Ch 1-4), training (Ch 5-8), and inference (Ch 9-11) fundamentals. These don’t change. A new model from any lab is still: tokenizer → embeddings → transformer blocks → output head, trained with next-token prediction, aligned with RLHF/DPO, served with KV cache + batching. The fundamentals you’ve learned apply to every model.

Model Families

# Frontier models (2024-2025): # OpenAI: GPT-4o, o1, o3 # → Multimodal, reasoning (test-time compute) # → Closed source, API only # Anthropic: Claude 3.5 Sonnet/Opus # → Safety-focused, Constitutional AI # → 200K context, strong coding # Google: Gemini 1.5/2.0 # → Native multimodal, 1M context # → MoE architecture # Meta: Llama 3/3.1/3.2 (1B-405B) # → Open weights, community ecosystem # → Most fine-tuned model family # DeepSeek: V3, R1 # → MoE (671B/37B active), reasoning # → Open weights, remarkably efficient # Mistral: Mistral, Mixtral, Large # → European, efficient, open weights

lock_open

Open Models: The Democratization of AI

Llama, Mistral, DeepSeek — why open matters

Why Open Matters

Open-weight models (Llama, Mistral, DeepSeek, Qwen) have transformed AI. You can download them, fine-tune them (Ch 7), run them locally (Ch 11), and inspect their weights. This enables: privacy (data never leaves your server), customization (fine-tune for your domain), cost control (no per-token API fees), and research (study how models work). The open ecosystem is now competitive with closed models on many tasks.

Key insight: The gap between open and closed models has narrowed dramatically. Llama 3.1 405B rivals GPT-4 on many benchmarks. DeepSeek-R1 matches o1 on math and coding. Qwen 2.5 72B is competitive with Claude 3.5 Sonnet. For most applications, open models are “good enough” — and the ability to fine-tune and self-host makes them the practical choice for many companies.

Open Model Ecosystem

# Open model ecosystem: # Model hubs: # HuggingFace: 800K+ models # Ollama: one-command local serving # Fine-tuning tools: # HuggingFace PEFT, Unsloth, Axolotl # Serving frameworks: # vLLM, TGI, llama.cpp, SGLang # Quantization: # GGUF (llama.cpp), GPTQ, AWQ # Evaluation: # lm-eval-harness, HELM, Chatbot Arena # The virtuous cycle: # Open model → community fine-tunes → # better models → more adoption → # more investment → better open models

lock

Closed Models: The Frontier Edge

Why GPT-4, Claude, and Gemini still lead on the hardest tasks

The Closed Advantage

Closed models maintain an edge on the hardest tasks: complex reasoning (o1/o3), long-context synthesis (Claude 200K), multimodal understanding (Gemini 1.5). They benefit from: massive compute budgets ($100M+ training), proprietary data, extensive RLHF, and system-level optimizations (tool use, function calling, structured output). For cutting-edge applications, closed APIs are often still the best choice.

Key insight: The practical choice depends on your constraints. Use closed APIs when you need the absolute best quality, don’t have ML expertise, or need features like function calling and structured output. Use open models when you need privacy, cost control, customization, or offline capability. Many production systems use both: open models for high-volume simple tasks, closed APIs for complex reasoning.

Decision Framework

# When to use what: # Closed API (GPT-4, Claude, Gemini): # ✓ Best quality on hard tasks # ✓ No infrastructure needed # ✓ Built-in tools, function calling # ✗ Per-token cost, vendor lock-in # ✗ Data sent to third party # Open self-hosted (Llama, Qwen, DeepSeek): # ✓ Full control, privacy # ✓ Fine-tunable for your domain # ✓ Fixed cost (GPU rental) # ✗ Requires ML/infra expertise # ✗ Slightly lower quality ceiling # Hybrid (most common in production): # Simple tasks → small open model (cheap) # Hard tasks → frontier API (quality) # Sensitive data → self-hosted (privacy)

build

AI Agents: LLMs That Take Action

From chatbots to autonomous systems

The Next Frontier

The biggest shift in 2024-2025 is from chatbots (LLMs that answer questions) to agents (LLMs that take actions). An agent can: browse the web, write and execute code, call APIs, manage files, and orchestrate multi-step workflows. The LLM is the “brain” that plans and reasons; tools are the “hands” that interact with the world. This is where all the concepts from this course converge.

Key insight: Agents combine everything: the LLM generates a plan (Ch 9), uses tool calling (function calling) to execute steps, maintains context (Ch 10) across a long workflow, and self-corrects using chain-of-thought reasoning. Frameworks like LangChain, CrewAI, and OpenAI’s Assistants API make this accessible. The challenge: reliability. Agents compound LLM errors across steps, so each step must be robust.

Agent Architecture

# Agent loop (simplified): while not done: # 1. LLM plans next action action = llm.generate( context + tools + history ) # 2. Execute the action if action.type == "search": result = web_search(action.query) elif action.type == "code": result = execute_python(action.code) elif action.type == "api_call": result = call_api(action.endpoint) # 3. Add result to context history.append(action, result) # 4. LLM decides: continue or finish? if action.type == "final_answer": done = True # Frameworks: LangChain, CrewAI, # OpenAI Assistants, Anthropic MCP

map

The Complete Picture: Connecting All 14 Chapters

From raw text to intelligent agent — the full pipeline

The Journey

You now understand the complete LLM pipeline: Text → Tokens (Ch 1) → Embeddings (Ch 2) → Attention (Ch 3) → Transformer Blocks (Ch 4) → Scale (Ch 5) → Pretraining (Ch 6) → Fine-tuning (Ch 7) → Alignment (Ch 8) → Generation (Ch 9) → Context (Ch 10) → Optimization (Ch 11) → Multimodal (Ch 12) → Capabilities & Limits (Ch 13) → The Landscape (Ch 14). Every concept builds on the previous ones.

Key insight: The entire field rests on one elegant idea: predict the next token. Everything else — attention, scaling laws, RLHF, quantization, multimodality — is engineering to make that simple idea work better, faster, and more safely. The transformer architecture from 2017 is still the foundation. The innovation is in training recipes, data, alignment, and serving infrastructure.

The Full Pipeline

# The complete LLM pipeline: # ARCHITECTURE (Ch 1-4): # text → tokenize → embed → N × ( # norm → attention → residual → # norm → FFN → residual # ) → norm → output head → logits # TRAINING (Ch 5-8): # Pretrain (15T tokens, next-token pred) # → SFT (100K examples, instruction format) # → RLHF/DPO (50K prefs, quality+safety) # INFERENCE (Ch 9-11): # Prompt → prefill (parallel) → # decode (sequential, KV cache) → # sample (temperature + top-p) → output # Optimized: quantize + FlashAttn + batch # FRONTIER (Ch 12-14): # + Vision (ViT + adapter) # + Audio (codec tokens) # + Reasoning (CoT, test-time compute) # + Agents (tool use, planning)

rocket_launch

What’s Next: The Road Ahead

Where LLMs are heading and what to learn next

Emerging Trends

Test-time compute scaling (o1, R1): thinking longer for better answers. Smaller, smarter models: 3B models matching 2023’s 70B. On-device AI: LLMs on phones and laptops. AI agents: autonomous multi-step workflows. Multimodal unification: one model for all modalities. Synthetic data: models training models. New architectures: Mamba (state-space), RWKV (linear attention), exploring beyond transformers.

Key insight: You now have the foundation to understand any new development in LLMs. When a new model is announced, you can ask: What’s the architecture? (Ch 4) How was it trained? (Ch 6-8) What’s the context length? (Ch 10) How is it served? (Ch 11) What are its limitations? (Ch 13) This mental framework will serve you well as the field continues to evolve at breakneck speed. The fundamentals don’t change — only the details.

Your Next Steps

# What to explore next: # 1. Hands-on: Run a model locally # → Install Ollama, try Llama 3.2 3B # 2. Fine-tune: Customize a model # → QLoRA on your own dataset # 3. Build: Create an AI application # → RAG system with your documents # 4. Deepen: Read the key papers # → "Attention Is All You Need" (2017) # → "Language Models are Few-Shot" (GPT-3) # → "Training language models to follow # instructions" (InstructGPT) # → "LLaMA: Open and Efficient" (2023) # 5. Stay current: Follow the field # → arxiv.org/list/cs.CL # → HuggingFace blog # → Chatbot Arena leaderboard # You now understand how LLMs work. # Go build something amazing. 🚀

arrow_back Ch 13: Emergent Abilities Back to Index arrow_forward