Qwen3 benchmark results
Benchmark results for Qwen3 models using the Aider polyglot coding benchmark.
Benchmark results for Qwen3 models using the Aider polyglot coding benchmark.
The $6.32 benchmark cost reported for Gemini 2.5 Pro Preview 03-25 was incorrect.
DeepSeek's API has been experiencing reliability issues. Here are alternative providers you can use.
R1+Sonnet has set a new SOTA on the aider polyglot benchmark. At 14X less cost compared to o1.
Reliably packaging & distributing python CLI tools is hard. Aider uses uv in novel ways to make it easy to install the aider CLI, its dependencies…
o1 scores the top result on aider's new multi-language, more challenging coding benchmark.
QwQ is reasoning model like o1, and needs to be used as an architect with another model as editor.
Open source LLMs are becoming very powerful, but pay attention to how you (or your provider) are serving the model. It can affect code editing skill.
An Architect model describes how to solve the coding problem, and an Editor model translates that into file edits. This Architect/Editor approach produces SOTA benchmark results.
Preliminary benchmark results for the new OpenAI o1 models.