arXiv cs.AI
· Papers
COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models
arXiv:2606.28696v1 Announce Type: new Abstract: Composition is a high-level visual intent that governs where subjects are placed and how a scene is organized, yet current unified multimodal models remain unreliable at fine-grained composition recognition and struggle to turn such intent into controllable generation. We