AI助手自主运行能力显著提升

The company said that the model was able to run autonomously for 30 hours, maintaining sustained focus with minimal oversight while building an entire software application. It’s a significant improvement over the company’s previous Opus 4 model, released four months ago, which could operate autonomously for only seven hours.

Anthropic said Claude Sonnet 4.5 also outperformed Opus on key benchmarks and was more effective in meeting customers’ practical business needs. The company said the model was even better at coding than previous frontier models, and state-of-the-art on SWE-Bench Verified, a key benchmark that tests how models perform at software development tasks. Anthropic said that Claude Sonnet 4.5 was better than its predecessors at following instructions, identifying code improvements, and generating more production-ready code. When tested on tasks from the financial services industry, the company said the new model outperformed earlier Claude models in tasks such as researching, building financial models, and forecasting.

Anthropic appears to be pushing further ahead of its competitors in coding assistance and autonomous task completion, positioning its models toward corporate and workplace use. The company’s previous Claude 4.1 Opus model already bested competitors on OpenAI’s new benchmark of professional task completion, GDPval, which tested how models performed compared to human professionals across a range of industries and jobs.

Last week, OpenAI said its GPT-5 model and Anthropic’s Claude Opus 4.1 were “already approaching the quality of work produced by industry experts.”

Dueling usage studies released earlier this month also suggested that Anthropic’s Claude models were emerging as more professionally-oriented AI models, especially in comparison to OpenAI’s ChatGPT, which is increasingly being used as a consumer product.

According to the study, most Claude users were turning to the models for workplace or productivity tasks, with mathematical tasks and coding cited as the dominant activities globally for Claude.ai, and making up 36% of all use cases.

Business use of Claude leaned heavily toward task automation. According to the study, approximately 77% of prompts that the model receives through its API—the application programming interface that is primarily used by enterprise customers—involve users requesting the system to perform tasks on their behalf, rather than just providing advice or suggestions. These business-focused interactions are also concentrated in coding, which accounts for 44% of API use. A further 5% of API usage was dedicated to developing or evaluating AI systems.

The tasks that business users automate also tend to be the most expensive ones to run. The findings indicate a shift in how businesses approach these tools. Rather than using them mainly for decision support or research, many teams are relying on them to take work off their plates entirely.

If models like Claude are able to become more capable of autonomous work, especially in complex, time-intensive domains like software engineering, the implications for businesses and employees could be significant. Autonomous agents can reduce the need for constant human oversight and lower costs on repetitive workflows, speeding up a company’s operations and potentially reducing the need for headcount.

Fortune Global Forum

returns Oct. 26–27, 2025 in Riyadh. CEOs and global leaders will gather for a dynamic, invitation-only event shaping the future of business.

Apply for an invitation.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签