Skip to content
arXiv cs.LG · Papers

Reconstruction Alignment Improves Unified Multimodal Models

arXiv:2509.07295v4 Announce Type: replace-cross Abstract: Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details, even when