Ch 6 — Text-to-Image Generation

Stable Diffusion, DALL-E 3, Midjourney, Flux — ControlNet, inpainting, and the creative workflow
High Level
text_fields
Prompt
arrow_forward
link
Encode
arrow_forward
blur_on
Diffuse
arrow_forward
tune
Control
arrow_forward
image
Output
arrow_forward
palette
Create
-
Click play or press Space to begin...
Step- / 8
landscape
The Text-to-Image Landscape
Major models and their architectures
Model Comparison
// Major text-to-image models (2025) Stable Diffusion XL Open-source, U-Net Text: dual CLIP + OpenCLIP Res: 1024×1024, community ecosystem Flux (Black Forest) Open-weight, DiT Text: T5-XXL + CLIP, Flow Matching Res: up to 2048×2048, state-of-art open DALL-E 3 (OpenAI) Closed, proprietary Text: GPT-4 rewrites prompts internally Res: 1024×1024, strong text rendering Midjourney v6 Closed, proprietary Aesthetic focus, Discord-based Res: up to 2048×2048, best aesthetics Imagen 3 (Google) Closed, proprietary Text: T5-XXL encoder Res: 1024×1024, strong photorealism
Open vs Closed
Open-source (SD, Flux): Run locally, full control, fine-tunable, community LoRAs and extensions, no content restrictions, no API costs
Closed (DALL-E, Midjourney): Higher quality out-of-box, no GPU needed, content safety filters, pay-per-image, no customization

The gap is closing rapidly. Flux rivals DALL-E 3 quality while being open-weight.
Key insight: The text-to-image landscape is bifurcating: closed models optimize for safety and ease-of-use (DALL-E, Midjourney), while open models optimize for control and customization (SD, Flux). Choose based on your needs: creative freedom vs convenience.
architecture
Stable Diffusion Architecture
The open-source model that democratized image generation
Three Components
1. Text Encoder (CLIP/T5): Converts your prompt into a sequence of embedding vectors. SDXL uses dual encoders (CLIP ViT-L + OpenCLIP ViT-bigG) for richer text understanding.

2. U-Net / DiT (Denoiser): The core diffusion model. Takes noisy latents + text embeddings, predicts the noise to remove. Cross-attention layers connect text to image generation.

3. VAE (Decoder): Converts the denoised latent representation back into a full-resolution pixel image. Pre-trained and frozen during diffusion training.
The Generation Pipeline
// Stable Diffusion XL generation Input: "a cyberpunk city at night, neon lights" Seed: 42 (for reproducibility) 1. CLIP encodes prompt → 77×2048 embeddings 2. Sample random noise in latent space (128×128) 3. Denoise for 30 steps (DPM-Solver++) - Each step: U-Net predicts noise - Cross-attention reads text embeddings - CFG scale 7.5 amplifies text influence 4. VAE decodes latent → 1024×1024 image Time: ~3 seconds on RTX 4090 VRAM: ~6 GB (fp16)
Key insight: The seed controls the initial random noise. Same prompt + same seed = same image. This is crucial for reproducibility, iteration, and controlled experiments. Change the seed to explore variations; keep it fixed to refine a specific composition.
tune
ControlNet & Conditioning
Adding spatial control beyond text prompts
The Problem
Text prompts alone can’t precisely control spatial layout. “A person standing on the left with a dog on the right” is ambiguous. ControlNet adds structural conditioning — you provide a reference image (edges, pose, depth) that guides the spatial arrangement.
ControlNet Types
Canny Edge: Preserves the edge structure of a reference image
OpenPose: Matches human body poses from a skeleton
Depth: Maintains the 3D depth layout of a scene
Scribble: Follows rough hand-drawn sketches
Segmentation: Fills in a semantic segmentation map
Normal Map: Preserves surface orientation for 3D-like control
How ControlNet Works
ControlNet creates a trainable copy of the U-Net’s encoder blocks. The control image (edges, pose, etc.) is processed by this copy, and its outputs are added to the original U-Net’s skip connections. This injects spatial information without modifying the base model.
IP-Adapter: Style Transfer
IP-Adapter is like ControlNet for style: provide a reference image and the model generates new images in that visual style. It works by injecting image embeddings (from CLIP’s image encoder) into the cross-attention layers alongside text embeddings.
Key insight: ControlNet transformed text-to-image from a “prompt lottery” into a precision tool. Professional workflows now combine text prompts for content with ControlNet for layout, enabling consistent, controllable generation.
edit
Inpainting, img2img & Outpainting
Editing existing images with diffusion
img2img: Image-to-Image
Instead of starting from pure noise, start from a partially noised version of an existing image. The “denoising strength” (0.0–1.0) controls how much the original is preserved:

0.2: Subtle changes — color correction, minor style shift
0.5: Moderate changes — new elements while keeping composition
0.8: Major changes — mostly new image inspired by original
1.0: Complete regeneration (same as txt2img)
Inpainting
Inpainting regenerates only a masked region of an image while keeping the rest intact. Use cases: remove objects, replace backgrounds, fix artifacts, add elements. The model sees the unmasked context and generates content that blends seamlessly.
Outpainting
Outpainting extends an image beyond its original borders. The model sees the existing image as context and generates new content that continues the scene naturally. Used to change aspect ratios, expand compositions, or create panoramic views.
DALL-E 3’s Approach
DALL-E 3 takes a unique approach: it uses GPT-4 to rewrite your prompt before generating. Your short prompt becomes a detailed, optimized description. This dramatically improves output quality but reduces user control over exact wording.
Key insight: The real power of text-to-image isn’t single-shot generation — it’s the iterative workflow: generate → select → inpaint → refine → upscale. Professional users rarely accept the first output; they iterate using these tools.
edit_note
Prompt Engineering for Images
Writing prompts that produce what you want
Prompt Structure
// Effective prompt formula [Subject] + [Action/Pose] + [Setting] + [Style] + [Lighting] + [Quality] // Example: "A samurai warrior standing on a cliff overlooking a misty valley at dawn, cinematic lighting, volumetric fog, 8K, highly detailed, artstation" // Negative prompt (what to avoid): "blurry, low quality, deformed hands, extra fingers, watermark, text"
Key Techniques
Be specific: “golden retriever puppy” not “dog”
Describe style: “oil painting”, “photograph”, “watercolor”, “3D render”
Specify lighting: “dramatic side lighting”, “soft diffused light”, “golden hour”
Add quality tokens: “masterpiece”, “highly detailed”, “8K” (model-dependent)
Use negative prompts: Exclude unwanted elements explicitly
Weight tokens: (important concept:1.5) increases emphasis in SD
Key insight: Prompt engineering for images is different from text LLMs. Image models respond to descriptive keywords more than instructions. “A photo of a cat” works better than “Generate me a picture of a cat please.” Think art direction, not conversation.
model_training
LoRA & Fine-Tuning
Customizing models for specific styles, characters, and concepts
What Is LoRA?
Low-Rank Adaptation (LoRA) adds small trainable matrices to the model’s attention layers. Instead of fine-tuning all 3.5B parameters of SDXL, you train ~10–100M additional parameters. The LoRA file is tiny (10–200MB) and can be loaded/unloaded at inference time.
Common LoRA Use Cases
Character consistency: Train on 20–50 images of a character to generate them in any pose/setting
Art style: Train on an artist’s work to replicate their style
Product photography: Train on product images for consistent branding
Concept injection: Teach the model new concepts it wasn’t trained on
Training a LoRA
// Typical LoRA training setup Images: 20-50 high-quality images Captions: Auto-generated or manual Steps: 1,000-3,000 training steps Time: 15-60 minutes on consumer GPU Output: 10-200 MB LoRA file Trigger: "sks style" or custom token // Can combine multiple LoRAs at inference // e.g., character LoRA + style LoRA
Key insight: LoRAs are the “plugins” of the image generation ecosystem. CivitAI hosts 100,000+ community LoRAs. They enable personalization without retraining the base model — and multiple LoRAs can be combined at inference time for layered customization.
palette
The Professional Creative Workflow
How professionals actually use text-to-image tools
The Iterative Loop
// Professional workflow (not one-shot!) 1. Explore Generate 4-8 variations with different seeds Identify promising compositions 2. Refine img2img on best candidate (strength 0.3-0.5) Adjust prompt, add/remove details 3. Fix Inpaint problem areas (hands, faces, text) Use ControlNet for precise corrections 4. Upscale 4x upscale with Real-ESRGAN or tiled diffusion Add fine detail at higher resolution 5. Post-process Color correction, compositing in Photoshop Final touches for production use
Production Applications
Concept art: Game and film studios use AI for rapid ideation (10x faster than manual)
Marketing: A/B test ad creatives at scale — generate 100 variations in minutes
E-commerce: Product photography without physical shoots
Architecture: Visualize building designs from floor plans
Fashion: Virtual try-on and collection visualization
Publishing: Book covers, article illustrations, social media content
Key insight: Text-to-image AI doesn’t replace artists — it changes their workflow. The most effective users are those who combine AI generation with traditional skills: composition, color theory, and iterative refinement. AI is a power tool, not an autopilot.
school
Key Takeaways
What to remember about text-to-image generation
Essential Concepts
1. Architecture: Text encoder (CLIP/T5) + Denoiser (U-Net/DiT) + VAE decoder

2. Open vs Closed: SD/Flux for control and customization; DALL-E/Midjourney for convenience

3. ControlNet: Adds spatial control via edges, poses, depth maps — transforms generation into a precision tool

4. Inpainting/img2img: Edit existing images; the real workflow is iterative, not one-shot

5. LoRA: Lightweight fine-tuning for custom styles, characters, and concepts (10–200MB files)
Choosing a Model
Need customization? → Stable Diffusion / Flux (open, LoRA ecosystem)
Need text in images? → DALL-E 3 or Flux (best text rendering)
Need aesthetics? → Midjourney (best artistic quality)
Need speed? → SDXL Turbo / LCM (real-time)
Need safety? → DALL-E 3 (built-in content filters)
Next up: Chapter 7 extends generation to the temporal dimension — text-to-video with Sora’s Diffusion Transformer, spacetime patches, temporal coherence, and the hard limits of current video generation.