Ch 6: Text-to-Image Generation

landscape

The Text-to-Image Landscape

Major models and their architectures

Model Comparison

// Major text-to-image models (2025) Stable Diffusion XL Open-source, U-Net Text: dual CLIP + OpenCLIP Res: 1024×1024, community ecosystem Flux (Black Forest) Open-weight, DiT Text: T5-XXL + CLIP, Flow Matching Res: up to 2048×2048, state-of-art open DALL-E 3 (OpenAI) Closed, proprietary Text: GPT-4 rewrites prompts internally Res: 1024×1024, strong text rendering Midjourney v6 Closed, proprietary Aesthetic focus, Discord-based Res: up to 2048×2048, best aesthetics Imagen 3 (Google) Closed, proprietary Text: T5-XXL encoder Res: 1024×1024, strong photorealism

Open vs Closed

• Open-source (SD, Flux): Run locally, full control, fine-tunable, community LoRAs and extensions, no content restrictions, no API costs
• Closed (DALL-E, Midjourney): Higher quality out-of-box, no GPU needed, content safety filters, pay-per-image, no customization

The gap is closing rapidly. Flux rivals DALL-E 3 quality while being open-weight.

Key insight: The text-to-image landscape is bifurcating: closed models optimize for safety and ease-of-use (DALL-E, Midjourney), while open models optimize for control and customization (SD, Flux). Choose based on your needs: creative freedom vs convenience.

architecture

Stable Diffusion Architecture

The open-source model that democratized image generation

Three Components

1. Text Encoder (CLIP/T5): Converts your prompt into a sequence of embedding vectors. SDXL uses dual encoders (CLIP ViT-L + OpenCLIP ViT-bigG) for richer text understanding.

2. U-Net / DiT (Denoiser): The core diffusion model. Takes noisy latents + text embeddings, predicts the noise to remove. Cross-attention layers connect text to image generation.

3. VAE (Decoder): Converts the denoised latent representation back into a full-resolution pixel image. Pre-trained and frozen during diffusion training.

The Generation Pipeline

// Stable Diffusion XL generation Input: "a cyberpunk city at night, neon lights" Seed: 42 (for reproducibility) 1. CLIP encodes prompt → 77×2048 embeddings 2. Sample random noise in latent space (128×128) 3. Denoise for 30 steps (DPM-Solver++) - Each step: U-Net predicts noise - Cross-attention reads text embeddings - CFG scale 7.5 amplifies text influence 4. VAE decodes latent → 1024×1024 image Time: ~3 seconds on RTX 4090 VRAM: ~6 GB (fp16)

Key insight: The seed controls the initial random noise. Same prompt + same seed = same image. This is crucial for reproducibility, iteration, and controlled experiments. Change the seed to explore variations; keep it fixed to refine a specific composition.

tune

ControlNet & Conditioning

Adding spatial control beyond text prompts

The Problem

Text prompts alone can’t precisely control spatial layout. “A person standing on the left with a dog on the right” is ambiguous. ControlNet adds structural conditioning — you provide a reference image (edges, pose, depth) that guides the spatial arrangement.

ControlNet Types

• Canny Edge: Preserves the edge structure of a reference image
• OpenPose: Matches human body poses from a skeleton
• Depth: Maintains the 3D depth layout of a scene
• Scribble: Follows rough hand-drawn sketches
• Segmentation: Fills in a semantic segmentation map
• Normal Map: Preserves surface orientation for 3D-like control

How ControlNet Works

ControlNet creates a trainable copy of the U-Net’s encoder blocks. The control image (edges, pose, etc.) is processed by this copy, and its outputs are added to the original U-Net’s skip connections. This injects spatial information without modifying the base model.

IP-Adapter: Style Transfer

IP-Adapter is like ControlNet for style: provide a reference image and the model generates new images in that visual style. It works by injecting image embeddings (from CLIP’s image encoder) into the cross-attention layers alongside text embeddings.

Key insight: ControlNet transformed text-to-image from a “prompt lottery” into a precision tool. Professional workflows now combine text prompts for content with ControlNet for layout, enabling consistent, controllable generation.

edit

Inpainting, img2img & Outpainting

Editing existing images with diffusion

img2img: Image-to-Image

Instead of starting from pure noise, start from a partially noised version of an existing image. The “denoising strength” (0.0–1.0) controls how much the original is preserved:

• 0.2: Subtle changes — color correction, minor style shift
• 0.5: Moderate changes — new elements while keeping composition
• 0.8: Major changes — mostly new image inspired by original
• 1.0: Complete regeneration (same as txt2img)

Inpainting

Inpainting regenerates only a masked region of an image while keeping the rest intact. Use cases: remove objects, replace backgrounds, fix artifacts, add elements. The model sees the unmasked context and generates content that blends seamlessly.

Outpainting

Outpainting extends an image beyond its original borders. The model sees the existing image as context and generates new content that continues the scene naturally. Used to change aspect ratios, expand compositions, or create panoramic views.

DALL-E 3’s Approach

DALL-E 3 takes a unique approach: it uses GPT-4 to rewrite your prompt before generating. Your short prompt becomes a detailed, optimized description. This dramatically improves output quality but reduces user control over exact wording.

Key insight: The real power of text-to-image isn’t single-shot generation — it’s the iterative workflow: generate → select → inpaint → refine → upscale. Professional users rarely accept the first output; they iterate using these tools.

edit_note

Prompt Engineering for Images

Writing prompts that produce what you want

Prompt Structure

// Effective prompt formula [Subject] + [Action/Pose] + [Setting] + [Style] + [Lighting] + [Quality] // Example: "A samurai warrior standing on a cliff overlooking a misty valley at dawn, cinematic lighting, volumetric fog, 8K, highly detailed, artstation" // Negative prompt (what to avoid): "blurry, low quality, deformed hands, extra fingers, watermark, text"

Key Techniques

• Be specific: “golden retriever puppy” not “dog”
• Describe style: “oil painting”, “photograph”, “watercolor”, “3D render”
• Specify lighting: “dramatic side lighting”, “soft diffused light”, “golden hour”
• Add quality tokens: “masterpiece”, “highly detailed”, “8K” (model-dependent)
• Use negative prompts: Exclude unwanted elements explicitly
• Weight tokens: (important concept:1.5) increases emphasis in SD

Key insight: Prompt engineering for images is different from text LLMs. Image models respond to descriptive keywords more than instructions. “A photo of a cat” works better than “Generate me a picture of a cat please.” Think art direction, not conversation.

model_training

LoRA & Fine-Tuning

Customizing models for specific styles, characters, and concepts

What Is LoRA?

Low-Rank Adaptation (LoRA) adds small trainable matrices to the model’s attention layers. Instead of fine-tuning all 3.5B parameters of SDXL, you train ~10–100M additional parameters. The LoRA file is tiny (10–200MB) and can be loaded/unloaded at inference time.

Common LoRA Use Cases

• Character consistency: Train on 20–50 images of a character to generate them in any pose/setting
• Art style: Train on an artist’s work to replicate their style
• Product photography: Train on product images for consistent branding
• Concept injection: Teach the model new concepts it wasn’t trained on

Training a LoRA

// Typical LoRA training setup Images: 20-50 high-quality images Captions: Auto-generated or manual Steps: 1,000-3,000 training steps Time: 15-60 minutes on consumer GPU Output: 10-200 MB LoRA file Trigger: "sks style" or custom token // Can combine multiple LoRAs at inference // e.g., character LoRA + style LoRA

Key insight: LoRAs are the “plugins” of the image generation ecosystem. CivitAI hosts 100,000+ community LoRAs. They enable personalization without retraining the base model — and multiple LoRAs can be combined at inference time for layered customization.

palette

The Professional Creative Workflow

How professionals actually use text-to-image tools

The Iterative Loop

// Professional workflow (not one-shot!) 1. Explore Generate 4-8 variations with different seeds Identify promising compositions 2. Refine img2img on best candidate (strength 0.3-0.5) Adjust prompt, add/remove details 3. Fix Inpaint problem areas (hands, faces, text) Use ControlNet for precise corrections 4. Upscale 4x upscale with Real-ESRGAN or tiled diffusion Add fine detail at higher resolution 5. Post-process Color correction, compositing in Photoshop Final touches for production use

Production Applications

• Concept art: Game and film studios use AI for rapid ideation (10x faster than manual)
• Marketing: A/B test ad creatives at scale — generate 100 variations in minutes
• E-commerce: Product photography without physical shoots
• Architecture: Visualize building designs from floor plans
• Fashion: Virtual try-on and collection visualization
• Publishing: Book covers, article illustrations, social media content

Key insight: Text-to-image AI doesn’t replace artists — it changes their workflow. The most effective users are those who combine AI generation with traditional skills: composition, color theory, and iterative refinement. AI is a power tool, not an autopilot.

school

Key Takeaways

What to remember about text-to-image generation

Essential Concepts

1. Architecture: Text encoder (CLIP/T5) + Denoiser (U-Net/DiT) + VAE decoder

2. Open vs Closed: SD/Flux for control and customization; DALL-E/Midjourney for convenience

3. ControlNet: Adds spatial control via edges, poses, depth maps — transforms generation into a precision tool

4. Inpainting/img2img: Edit existing images; the real workflow is iterative, not one-shot

5. LoRA: Lightweight fine-tuning for custom styles, characters, and concepts (10–200MB files)

Choosing a Model

• Need customization? → Stable Diffusion / Flux (open, LoRA ecosystem)
• Need text in images? → DALL-E 3 or Flux (best text rendering)
• Need aesthetics? → Midjourney (best artistic quality)
• Need speed? → SDXL Turbo / LCM (real-time)
• Need safety? → DALL-E 3 (built-in content filters)

Next up: Chapter 7 extends generation to the temporal dimension — text-to-video with Sora’s Diffusion Transformer, spacetime patches, temporal coherence, and the hard limits of current video generation.

Ch 6 — Text-to-Image Generation