arXiv cs.LG
· Papers
Reconstruction Alignment Improves Unified Multimodal Models
arXiv:2509.07295v4 Announce Type: replace-cross Abstract: Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details, even when