r/MachineLearning June 24, 2026 · Communities

DeepSWE: new benchmark looking at how well today’s frontier models can actually write code [R]

DeepSWE delivers four advances over existing public benchmarks: Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining. High diversity: Tasks span a broad pool of 91 repositories across 5 languages. Real-world complexity: Prompt

Read original