arXiv cs.CV June 25, 2026 · Papers

MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

arXiv:2606.25225v1 Announce Type: new Abstract: Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a natural next step

Read original