r/MachineLearning
· Communities
DeepSWE: new benchmark looking at how well today’s frontier models can actually write code [R]
DeepSWE delivers four advances over existing public benchmarks: Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining. High diversity: Tasks span a broad pool of 91 repositories across 5 languages. Real-world complexity: Prompt