LLM Watch

Anthropic’s Claude Sonnet 4.5 and Artifacts: Toward an AI That Works, Not Just Chats

Daniel Brooks

By: Daniel Brooks

Tuesday, September 30, 2025

Sep 30, 2025

6 min read

Claude Sonnet 4.5 pairs long-running agentic performance with Artifacts, turning chat outputs into living files teams can edit, version, and ship together. Photo Credit: Anthropics

Key Takeaways

Claude Sonnet 4.5 extends multi-hour autonomy, leads on “computer use” benchmarks, and improves reasoning and math performance compared to earlier models.
It retains the same pricing as Claude Sonnet 4: $3 per million input tokens / $15 per million output tokens.
The model enables new product features: checkpoints in Claude Code, refreshed terminal UX, context editing, memory tools, file creation, and a VS Code extension.
Anthropic is packaging these capabilities into a workspace-like experience, Artifacts plus an Agent SDK, to make Claude not just a model but a platform for long-lived agentic work.

If the prior era was about “which model is the smartest,” the next battleground is “which model is most dependable, integrated, and durable.” With Claude Sonnet 4.5 and the enhancements around it, Anthropic is leaning into that shift.

What’s new and improved in Claude Sonnet 4.5

Sonnet 4.5 sustains 30+ hours of autonomous work and excels at computer-use tasks

A marquee claim in the Anthropic announcement is that Claude Sonnet 4.5 “maintains focus for more than 30 hours on complex, multi-step tasks.” [1] This is a departure from shorter-lived sessions in earlier models, and signals a push toward AI agents that can carry out extended workflows autonomously.

On real-world computer tasks, Claude 4.5 shows major gains: in OSWorld, a benchmark designed to test AI models executing computer operations (navigating UIs, using command-line tools, etc.), Sonnet 4.5 achieves 61.4 %, up from 42.2 % in Sonnet 4 just months prior. [1] This leap underscores the model’s improved “tool use” competence.

Claude 4.5 outperforms on SWE-bench, reasoning, and domain knowledge benchmarks

According to the announcement, Claude 4.5 leads on SWE-bench Verified, a benchmark for realistic software engineering tasks, using a two-tool scaffold (bash + file editing). [1] Their reported score is 77.2 % (averaged over 10 trials under a 200 K “thinking” budget). [1] Under a 1 M context setting, the model achieves 78.2 %. [1] With additional parallel compute, Anthropic reports a high-compute version hitting 82.0 %. [1]

Beyond coding, Anthropic reports improvements in math and general reasoning, stating that domain experts in finance, law, medicine, and STEM saw “dramatically better domain-specific knowledge and reasoning” compared to older Claude models. [1]

Pricing remains identical to Sonnet 4, making upgrades cost-free

Importantly, Sonnet 4.5 is offered under the same pricing as Sonnet 4: $3 per million input tokens, $15 per million output tokens. [1] Because of this, many users can upgrade to 4.5 with no cost penalty, reaping performance benefits without spending more.

Claude 4.5 is a drop-in replacement for existing use: it is available across Claude apps, Claude Code, and via the Claude API. [1]

Features & ecosystem advances bundled with Sonnet 4.5

The model release is paired with significant enhancements to the product ecosystem and tooling, reinforcing Anthropic’s view of Claude as an integrated workspace.

Ecosystem upgrades: checkpoints, memory tools, VS Code extension, and terminal refresh

In Claude Code, Sonnet 4.5 introduces checkpoints, allowing users to save intermediate progress and roll back changes instantly. [1]
In the Claude API, a context editing feature and an upgraded memory tool permit agents to run longer and tackle more complex problems by better handling context and state. [1]
The VS Code extension is refreshed, and the terminal interface is updated, aiming to improve developer UX. [1]

File creation in conversations & browser integration

Claude apps now support file creation (e.g. spreadsheets, slides, documents) mid-conversation, weaving document generation into the interaction flow. [1]
The Claude for Chrome extension is made available to Max users (those on the premium plan). [1] The extension can leverage the model’s upgraded “computer use” capabilities to operate in-browser. [1]

Developer empowerment through the Claude Agent SDK

A major new offering is the Claude Agent SDK, which provides the same underlying infrastructure that powers Claude Code’s agentic tasks. Developers can use it to build their own agents, benefiting from the advances in memory, context, and orchestration that underpin Sonnet 4.5. [1]

Anthropic emphasizes that they have solved challenging problems such as memory management across long tasks, permission systems balancing autonomy with control, and coordination among subagents. [1]

Stronger safety, alignment, and protections under ASL-3

Anthropic states that Sonnet 4.5 is their “most aligned frontier model yet,” with better behavior on metrics like deception, sycophancy, power-seeking, and delusional thinking. [1] They also report improved defenses against prompt injection attacks. [1]

Sonnet 4.5 is released under AI Safety Level 3 (ASL-3) protections, aligning with Anthropic’s framework for matching model capability with safeguards. [1] Users encountering interruptions due to filtering (e.g., false positives) may fallback to Claude Sonnet 4. [1]

From chat to workspace: the evolving role of Artifacts & agentic workflows

While the “Artifacts” concept was present in earlier Claude versions, the Sonnet 4.5 release strengthens its role as the persistent workspace where content, context, and interaction converge.

With file creation built into conversations, the boundary between chat and document/code output blurs: you can generate files in-context and continue working on them. [1]
Through memory tools and context editing, Claude can revisit, recall, and evolve artifacts across multiple sessions. [1]
Agentic workflows built on the Agent SDK can act on artifacts (e.g. code files, spreadsheets) as first-class objects, chaining logic and transformations over them. [1]
The persistent nature of artifacts enables project continuity: rather than outputs disappearing through chat history, they live in a workspace that supports rollback, versioning, and further edits.

In effect, Claude is shifting from “chat + attachments” toward a hybrid of conversational interface + lightweight integrated development environment (IDE) + workspace.

What this means for businesses, developers, and teams

Enterprise benefits: longer workflows, lower costs, and reliable task persistence

The jump in autonomy (30+ hours), better tool use, and maintained pricing mean Claude 4.5 can take on more ambitious, long-horizon tasks. Enterprises can rely on the model for sustained workflows without fear of context loss or collapse. The economic implications (getting more work per session) may be as meaningful as raw performance gains.

Developer advantages: faster iteration, long-lived agents, and better context handling

Faster iteration: You can generate code, test, edit, and re-run seamlessly via the integrated environment, less switching between chat and code editors.
Agent creation: The Agent SDK + improved runtime enables building more capable, longer-lived agents (e.g. for monitoring, automation, data pipelines).
Greater context control: Memory tools and context editing reduce the burden of prompt engineering and mitigate the brittleness of long chains.

Artifacts provides a shared, persistent canvas: marketing, legal, design, product, all can co-evolve shared documents, code, or visual assets. Because outputs persist and can be incrementally improved, collaboration becomes more transparent and versionable.

Claude Sonnet 4.5 signals that the AI frontier is shifting away from “which model scores highest” toward “which model endures, adapts, and integrates into real work.” Anthropic is betting that winning on durability, coherence, and ecosystem will matter more than marginal benchmark leads.

Why it matters: durable, integrated, cost-efficient AI for real work

Claude Sonnet 4.5 is one of the earliest systems designed to execute extended, agentic workflows reliably at scale. As organizations experiment with AI for long-running tasks (e.g. end-to-end data pipelines, autonomous monitoring, multi-step development), models must maintain memory, context, artifact continuity, and safety over many hours.

For any team evaluating AI adoption, the critical questions now include:

Can the model persist reliably over extended sessions?
Are artifacts and outputs first-class and editable, rather than buried in chat?
How well can the model integrate with external tools, software, and environments?
Does the pricing allow experimentation at scale, especially for longer tasks?
Are safety, alignment, and risk mitigations proportional to the model’s autonomy?

Claude Sonnet 4.5 is a bold statement: the future of AI isn’t just about reasoning strength, but sustained usefulness, and Claude is aiming to be that.