r/LocalLLaMA
· Communities
Step-3.7-Flash (198B-A11B vision MoE) on 4×3090 — fully-resident IQ3_XXS beats thespilled IQ4 by 2.4×, and MTP speculative decode silently breaks vision
TLDR: Biggest quant that fits 100% in VRAM beats a higher quant that spills. And if your MoE ships an MTP draft head, the draft can't decode image tokens — MTP and vision don't coexist. Solo dev, local vision+tools+reasoning daily-driver. Building mega projects, enterprise plugins and custom agents. 4×3090, no NVLink.