HF Daily Papers
· Papers
IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation
Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to