News from arXiv cs.CV

arXiv cs.CV Papers 16 hr ago

ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

arXiv:2606.23835v1 Announce Type: new Abstract: ABACUS is a unified vision-language model that handles object counting, crowd counting, referring-expression counting, and count-faithful image generation without any benchmark-specific…

arXiv cs.CV Papers 16 hr ago

From Spatial to Spectral: An Efficient, Frequency-Guided Feature Representation Learner for Small Object Detection

arXiv:2606.23825v1 Announce Type: new Abstract: Efficient small object detection is bottlenecked by the inherent feature scarcity of tiny targets, which is further aggravated by operations of…

arXiv cs.CV Papers 16 hr ago

Listening makes Vision Clear for VLMs

arXiv:2606.23763v1 Announce Type: new Abstract: Recent work typically assesses vision--language consistency using attention distributions of answer-side tokens. However, we observe that highest attention regions are not…

arXiv cs.CV Papers 16 hr ago

ArtiTwinSplat: Interactable Digital Twin Reconstruction via Gaussian Splatting from RGB-D videos

arXiv:2606.24628v1 Announce Type: cross Abstract: Deploying robots in unstructured real-world environments needs accurate, interactive models of the objects. Constructing these models at scale remains a critical…

Latest

ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

From Spatial to Spectral: An Efficient, Frequency-Guided Feature Representation Learner for Small Object Detection

Listening makes Vision Clear for VLMs

ArtiTwinSplat: Interactable Digital Twin Reconstruction via Gaussian Splatting from RGB-D videos