arXiv cs.LG
· Papers
Event-Grounded Question Answering over Long Audio via Structured Retrieval
arXiv:2602.14612v4 Announce Type: replace-cross Abstract: Answering natural-language questions over multi-hour audio requires both event recognition and temporal grounding. Current large audio-language models perform well on short clips, but are limited by context length, query-time cost, and weak temporal localization