Skip to content
HF Daily Papers · Papers

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

Unified multi-modal large language models (MLLMs) have achieved strong text-to-image generation quality, but still struggle with structure-aware prompt following, where object counts, spatial relations, attribute bindings, and coarse layouts must be preserved. We attribute this limitation in part to