Skip to content
arXiv cs.CV · Papers

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

arXiv:2606.26387v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to hallucinations that contradict their visual