arXiv cs.CV
· Papers
Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs
arXiv:2606.26387v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) extend large language models (LLMs) with visual perception, enabling joint reasoning over images and text. Despite inheriting strong reasoning capabilities from LLMs, they remain prone to hallucinations that contradict their visual