Claude Sonnet 4 Hits 72.7% on SWE-Bench, Rivals GPT-4.1 and Gemini 2.5 Pro

Anthropic unveiled its latest AI model, Claude Sonnet 4, alongside the more powerful Claude Opus 4. The new Sonnet 4 model promises significant improvements in coding, reasoning, and task handling while maintaining the same pricing as its predecessor.

The upgraded Sonnet 4 isn’t just a minor update. It shows impressive performance on software engineering tests, scoring 72.7% on the SWE-bench coding benchmark. This puts it nearly on par with the premium Opus 4 model in coding ability, despite being the more affordable option.

“Claude Sonnet 4 and Opus 4 transform AI from a tool into a true collaborator for every person and every team,” says Kate Jensen, head of Growth and Revenue at Anthropic. “Our customers will see project timelines shrink—in many cases from weeks to hours.”

Key Improvements

Sonnet 4 brings several major upgrades that change how people can work with AI:

Extended thinking with tool use allows Sonnet 4 to pause its reasoning, check information through tools like web search, and then continue where it left off. This helps the AI tackle more complex problems that require research.

The model follows instructions more precisely and shows significantly better memory when working with files. It can remember important information across longer conversations, helping it maintain context during extended tasks.

For everyday users, Sonnet 4 is available to free users, while also being available through Pro, Max, Team, and Enterprise plans. Pricing stays at $3/$15 per million tokens for input/output, making it the more affordable option compared to Opus 4’s $15/$75 rate.

Real-World Applications

Early adopters are already seeing benefits from Sonnet 4’s improvements. GitHub plans to use Sonnet 4 to power their coding assistant in GitHub Copilot. Other companies report significant gains in how the AI handles complex tasks:

iGent, a software company, found that Sonnet 4 reduced navigation errors in codebases from 20% to nearly zero, making it much more reliable for developers.

Sourcegraph noted the model “stays on track longer, understands problems more deeply, and provides more elegant code.”

Augment Code, another early user, reported “higher success rates, more surgical code edits, and more careful work through complex tasks.”

Agentic Capabilities

One of the most significant advancements in Sonnet 4 is its ability to function more like an “agent” – working independently on tasks with less human guidance.

Both new Claude models are 65% less likely to use shortcuts or loopholes when completing tasks compared to the previous version. This improvement makes the AI more trustworthy when handling complex assignments without constant supervision.

Scott White, Anthropic’s product lead, explained the practical benefit: “It’s like the kind of thing that is challenging that might represent 30% of your day, that isn’t necessarily fulfilling or professionally expanding you, but is necessary in the pursuit of being successful in your job.”

Key Improvements

Real-World Applications

Agentic Capabilities

Similar Posts

The Competitive Landscape

Concerns About Job Impact

Safety Measures

Leave a comment Cancel reply

Gadgets, News, Technology

iPhone 17e drops at $599 with 256GB—double the storage, same price as last year’s 128GB model

News, Technology

False Pentagon cyber warning spreads as 5M Iranians receive prayer app defection messages during strikes

Hardware, News, Technology

Nvidia Vera Rubin: 1.3M components, 10x efficiency, 72 GPUs ship H2 2026 to Meta, OpenAI

News, Technology

AirSnitch bypasses Wi-Fi encryption on 11 routers — “threat to worldwide network security” researchers warn

News, Technology

Anthropic Rejected Pentagon’s $200M Ultimatum — “We Cannot in Good Conscience” Drop AI Safety Rules

AI, News, Technology

Perplexity Computer orchestrates 19 AI models with 10K credits: Opus 4.6 coordinates month-long workflows

Claude Sonnet 4 Hits 72.7% on SWE-Bench, Rivals GPT-4.1 and Gemini 2.5 Pro

Key Improvements

Real-World Applications

Agentic Capabilities

Similar Posts

The Competitive Landscape

Concerns About Job Impact

Safety Measures

Share this:

Leave a comment Cancel reply

most recent

Gadgets, News, Technology

iPhone 17e drops at $599 with 256GB—double the storage, same price as last year’s 128GB model

News, Technology

False Pentagon cyber warning spreads as 5M Iranians receive prayer app defection messages during strikes

Hardware, News, Technology

Nvidia Vera Rubin: 1.3M components, 10x efficiency, 72 GPUs ship H2 2026 to Meta, OpenAI

News, Technology

AirSnitch bypasses Wi-Fi encryption on 11 routers — “threat to worldwide network security” researchers warn

News, Technology

Anthropic Rejected Pentagon’s $200M Ultimatum — “We Cannot in Good Conscience” Drop AI Safety Rules

AI, News, Technology

Perplexity Computer orchestrates 19 AI models with 10K credits: Opus 4.6 coordinates month-long workflows