Skip to content
Source · Daily Brief

AI Daily Brief — 24 February 2025

Monday delivered Anthropic’s biggest model release since 3.5 Sonnet plus the year’s most consequential open-source infrastructure drop. Claude 3.7 Sonnet shipped as the industry’s first hybrid reasoning model: one model with near-instant responses or visible step-by-step extended thinking, with API users setting a thinking-token budget up to 128K to trade speed/cost for quality. Available across all Claude tiers, the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI. Same pricing as 3.5 Sonnet — $3 input / $15 output per million tokens (including thinking tokens). 70.3% SWE-bench Verified, ahead of OpenAI o1 and DeepSeek R1 at launch. 81.2% TAU-bench retail. 84.8% GPQA Diamond in extended thinking mode, 96.5% physics subscore. 45% fewer unnecessary refusals than its predecessor. Alongside the model, Anthropic launched Claude Code as a limited research preview — a CLI agentic terminal tool that delegates substantial engineering tasks (searching/editing code, running tests, committing) to Claude. DeepSeek opened its five-day Open Source Week with FlashMLA — a Hopper-optimized MLA decoding kernel for variable-length sequences, hitting up to 3,000 GB/s in memory-bound and 580 TFLOPS in compute-bound workloads on H800 SXM5. xAI rolled out Grok 3 voice mode broadly to Premium+ and SuperGrok subscribers.

Top stories

  • Anthropic launches Claude 3.7 Sonnet — first hybrid reasoning model. One model, two modes (near-instant or visible step-by-step extended thinking). API thinking-token budget up to 128K. Available across Free / Pro / Team / Enterprise, the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI. Same pricing as 3.5 Sonnet: $3 input / $15 output per million tokens. via Anthropic
  • Claude 3.7 Sonnet: 70.3% SWE-bench Verified — state of the art at launch. Ahead of OpenAI o1 and DeepSeek R1. TAU-bench: 81.2% retail / 58.4% airline (beating o1’s 73.5% / 54.2%). GPQA Diamond: 68.0% standard / 84.8% extended thinking, 96.5% physics subscore. via llm-stats
  • Anthropic ships Claude Code as limited research preview. CLI agentic terminal tool — delegates substantial engineering tasks (searching/editing code, running tests, committing) to Claude directly from the terminal. Early entrant in the coding-agent-in-terminal category that OpenAI and Google would follow. via Anthropic
  • Claude 3.7 system card: 45% drop in unnecessary refusals, extended-thinking safety work. Multi-turn red-teaming showed extended thinking helped the model make better safety decisions in complex cybersecurity and self-harm scenarios. via Anthropic system card
  • DeepSeek kicks off Open Source Week with FlashMLA. Hopper-optimized Multi-Head Latent Attention decoding kernel for variable-length sequences. BF16/FP16 with a paged KV cache (block size 64). Up to 3,000 GB/s memory-bound / 580 TFLOPS compute-bound on H800 SXM5 — roughly 83% of theoretical memory bandwidth. via DeepSeek GitHub · via TechNode
  • xAI rolls out Grok 3 voice mode to Premium+ / SuperGrok. X Premium+ ($40/mo) and SuperGrok ($30/mo) subscribers got broad voice access — Unhinged, Romantic, Storyteller, Therapist, Professor, Meditation modes, Ara and Grok voices. Musk described the rollout as “a little patchy” and warned of bugs. via Social Media Today

Who shipped

The deepest single-day product shipping of the year so far. Anthropic shipped Claude 3.7 Sonnet + Claude Code + system card. DeepSeek shipped FlashMLA. xAI shipped a voice-mode tier expansion. OpenAI, Google DeepMind, Meta, and Mistral made no dated launches.

Open-source pulse

FlashMLA opens a five-day batched release week from DeepSeek. The Hopper-specific MLA kernel directly attacks one of the binding bottlenecks for production R1 / V3 serving: variable-length-sequence decoding throughput. Perplexity’s R1-1776, Stepfun’s Step-Video-T2V, and Nous DeepHermes-3 from the prior week continued spreading.

Money, infra & hardware

Claude 3.7 Sonnet’s launch with the same $3/$15 pricing as 3.5 Sonnet — including thinking tokens billed as output — set the pricing frame for hybrid reasoning. DeepSeek’s FlashMLA performance numbers (~83% of theoretical bandwidth) directly informed cost-of-inference debates for the rest of the week.

By the numbers

  • 70.3% — Claude 3.7 Sonnet SWE-bench Verified (state of the art at launch)
  • 200K / 64K / 128K — context / default output / extended-thinking budget ceiling
  • $3 / $15 per million tokens — input / output (same as 3.5 Sonnet)
  • 45% — drop in unnecessary refusals vs predecessor
  • 3,000 GB/s / 580 TFLOPS — FlashMLA H800 SXM5 memory-bound / compute-bound
  • Most-mentioned lab: Anthropic

Compiled by AI Feed’s editor from verified web sources for 24 February 2025.