VLM Understanding Metrics
• Accuracy: Correct answers on visual Q&A (multiple choice or exact match)
• BLEU/ROUGE/CIDEr: Text similarity between generated captions and reference captions. Useful but limited.
• CLIPScore: CLIP similarity between generated text and image. Measures text-image alignment without reference.
• Hallucination rate: % of responses containing objects/facts not present in the image
• Spatial accuracy: Correct identification of positions, sizes, and relationships
Metrics for Generation
• FID (Fréchet Inception Distance): Measures quality and diversity of generated images vs. real images. Lower is better.
• CLIPScore: How well does the generated image match the text prompt?
• Aesthetic score: Predicted human preference for visual quality
• IS (Inception Score): Quality and diversity of generated images
• Human preference: Side-by-side comparison rated by humans (gold standard)
Key insight: Automated metrics (FID, CLIPScore) are useful for rapid iteration but don’t fully capture human preferences. Always validate with human evaluation for high-stakes decisions. Use automated metrics for daily monitoring, human eval for milestone decisions.