arXiv cs.CV
· Papers
Listening makes Vision Clear for VLMs
arXiv:2606.23763v1 Announce Type: new Abstract: Recent work typically assesses vision--language consistency using attention distributions of answer-side tokens. However, we observe that highest attention regions are not always consistent with the intended semantic token. This probably stems from decoding drift, where l