Skip to content
arXiv cs.CV · Papers

Listening makes Vision Clear for VLMs

arXiv:2606.23763v1 Announce Type: new Abstract: Recent work typically assesses vision--language consistency using attention distributions of answer-side tokens. However, we observe that highest attention regions are not always consistent with the intended semantic token. This probably stems from decoding drift, where l