r/LocalLLaMA June 26, 2026 · Communities

We open-sourced a harness for evaluating VLMs on your own video, with traced runs

The framing that finally made VLM evaluation tractable for us is simple: decide what setup is right for your task, on your videos, at the quality, latency, and cost you can actually support. Once we framed it that way, the work changed. We stopped reading leaderboards and started building small eval sets from productio

Read original