Skip to content
arXiv cs.AI · Papers

COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models

arXiv:2606.28696v1 Announce Type: new Abstract: Composition is a high-level visual intent that governs where subjects are placed and how a scene is organized, yet current unified multimodal models remain unreliable at fine-grained composition recognition and struggle to turn such intent into controllable generation. We