Common Application Patterns
// Pattern 1: Document Understanding
Input: [invoice image] + "Extract all line items"
Output: Structured JSON with items, prices, totals
// Pattern 2: Visual QA Pipeline
Input: [product photo] + "Any defects visible?"
Output: "Scratch on upper-left corner, ~2cm long"
// Pattern 3: Multi-Image Comparison
Input: [before] + [after] + "What changed?"
Output: Detailed diff of visual changes
// Pattern 4: Screenshot to Code
Input: [UI mockup] + "Generate React component"
Output: Working JSX matching the design
Best Practices
• Be specific in prompts: “Count the red cars in the parking lot” not “What do you see?”
• Use system prompts: Define the model’s role and expected output format
• Provide examples: Few-shot with example image-response pairs
• Chain multiple calls: First identify regions of interest, then analyze each
• Validate outputs: Cross-check extracted data against known constraints
Pro tip: For structured extraction (invoices, forms, tables), ask the VLM to output JSON and validate the schema. For subjective tasks (quality inspection, content moderation), use confidence scores and human review for edge cases.