Ch 1 — What Is the Open Source AI Ecosystem & Why It Matters

The closed vs. open spectrum, the 4-layer stack, and why open source is reshaping AI
Foundation
public
Models
arrow_forward
description
Formats
arrow_forward
settings
Runtimes
arrow_forward
build
Tooling
arrow_forward
rocket_launch
Apps
-
Click play or press Space to begin the journey...
Step- / 7
warning
The Problem with Closed AI APIs
Cost, privacy, and vendor lock-in
Per-Token Costs Add Up
Commercial API pricing can become material at high request volume. Open-source deployments shift the cost model toward infrastructure you control and optimize directly.
Vendor Lock-In
Every API call goes through a third party. If they change pricing, deprecate a model, or have an outage, your product goes down. Open source models give you a stable, self-hosted foundation.
Data Privacy
Sending customer data to an external API raises compliance issues (GDPR, HIPAA, SOC 2). Many enterprises cannot send sensitive data to third-party AI APIs. Open source lets you run entirely within your own infrastructure.
Open source doesn't mean free of cost. You still pay for compute. But you own the infrastructure, control the model, and the marginal cost per token approaches zero at scale.
timeline
The Open Source AI Timeline
From BERT to Llama 4 — the acceleration
The Turning Point: LLaMA
Open-weight model releases accelerated community experimentation and adapter-based customization. The ecosystem moved from a few flagship models to many task-specific variants in a short period.
The Explosion
Recent model generations expanded capability, context handling, and open deployment options across multiple model families. This widened practical use from hobby demos to production copilots, retrieval systems, and domain-specific assistants.
Why It Accelerated
Chinchilla scaling laws (2022) showed models were undertrained. Techniques like LoRA (2021) made fine-tuning cheap. Quantization (GGML/GGUF) made large models run on laptops. Each breakthrough made the next one easier.
The ecosystem is maturing. Open-weight models now cover many practical workloads that previously required only closed APIs.
layers
The 4-Layer Open Source AI Stack
Models → Formats → Runtimes → Tooling
Layer 1: Models
The model weights encode learned behavior and are published with model cards and metadata on distribution hubs such as Hugging Face. Those cards describe license terms, intended use, and known limitations that teams should review before adoption.
Layer 2: Formats
How weights are stored and transported. SafeTensors: standard for training and cloud GPU inference. GGUF: optimized for CPU and local inference, includes quantization metadata. PyTorch .bin: legacy format, being replaced by SafeTensors.
Layer 3: Runtimes
What actually loads and runs the model weights. llama.cpp: runs GGUF on any hardware. Ollama: wraps llama.cpp with a one-command interface. vLLM: high-throughput production serving. TensorRT-LLM: NVIDIA-optimized inference.
Layer 4: Tooling
Application-layer frameworks abstract model calls and orchestration. LangChain, LlamaIndex, and DSPy each provide different strengths for workflows, retrieval, and programmatic optimization.
description
Model Formats: SafeTensors vs GGUF
Choosing the right format for your use case
SafeTensors (.safetensors)
Developed by Hugging Face. Safe (no arbitrary code execution unlike pickle-based .pt files). Zero-copy memory mapping for fast loading. The standard for storing and sharing full-precision model weights. Used for training, cloud serving, and fine-tuning.
PyTorch .bin (legacy)
The original format — uses Python pickle serialization. Security risk: malicious .bin files can execute arbitrary code. Being phased out in favor of SafeTensors. Most models on HuggingFace now offer SafeTensors versions.
GGUF
GPT-Generated Unified Format (created by llama.cpp's ggerganov). Single file containing quantized weights + model metadata + tokenizer. Optimized for CPU inference. Supports Q4, Q5, Q8 quantization levels. The standard for local AI with Ollama and llama.cpp.
Rule of thumb: Use SafeTensors for training, cloud GPUs, and fine-tuning. Use GGUF for local inference, Ollama, and llama.cpp on consumer hardware. Conversion tools exist in both directions.
settings
Runtimes: From Laptop to Data Center
llama.cpp, Ollama, vLLM, and TensorRT-LLM
Consumer Hardware (Laptop / Desktop)
Ollama: one-command ollama run llama3.2 — downloads, caches, and serves prebuilt model variants. Built on llama.cpp. LM Studio: GUI wrapper around llama.cpp with a model browser. llama.cpp directly: for custom configurations and integrations.
Developer Workstation (1-2 GPUs)
Ollama is common for local/developer iteration, while vLLM is commonly used when higher serving throughput is required. This split lets teams prototype quickly, then benchmark serving behavior before promoting workloads.
Production (Multi-GPU Cloud)
vLLM, TensorRT-LLM, and SGLang are widely used production inference options with different performance and integration tradeoffs. Engine selection usually depends on hardware stack, latency targets, and operational tooling preferences.
The beauty of open source: the same model family can often be deployed across Ollama, vLLM, and TensorRT-LLM with format and compatibility checks. The model is yours — you choose the runtime.
build
Tooling: Application Frameworks
LangChain, LlamaIndex, DSPy — when to use each
LangChain
A widely used AI application framework that provides chains, agents, memory patterns, and tool integrations with observability support. Best for: multi-step workflows, agent orchestration, quick prototyping.
LlamaIndex
RAG-first framework with the deepest retrieval tooling. LlamaParse for complex document parsing, 9 vector DB integrations, advanced strategies (CRAG, Self-RAG, HyDE, RAPTOR). Best for: RAG applications, document intelligence, agentic workflows over structured data.
DSPy
Stanford's framework that treats LLM programs as optimizable code. Instead of writing prompts by hand, you write modules that DSPy compiles into optimized prompts and few-shot examples. Best for: production-grade pipelines where prompt quality matters systematically.
You don't have to choose one. A common pattern: use LlamaIndex for ingestion and retrieval, LangChain for orchestration and tool use, and DSPy for the final optimization pass. They interoperate.
rocket_launch
Why This Course — What You'll Build
The 5 sections and where they take you
The 5 Sections
A. Foundation: HF Hub, libraries, open models — the lay of the land
B. Local AI: GGUF, Ollama, llama.cpp, LM Studio — run AI on your hardware
C. Fine-Tuning: LoRA, QLoRA, Unsloth, Axolotl — adapt models to your data
D. Production: vLLM, TGI, SGLang — serve thousands of requests per second
E. Applications: LangChain, LlamaIndex, DSPy, end-to-end stack
The Goal
By the end of this course, you'll understand every layer of the open source AI stack — from how models are stored and quantized, to running them locally, fine-tuning them on your data, serving them in production, and building applications on top of them.
No vendor lock-in. No API keys required. Everything in this course can be run entirely on your own hardware, with free and open-source tools.