Recursal AI development blog 09月25日
Featherless.ai Action-R1模型提升可靠性
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Featherless.ai Action-R1模型在REAL基准测试中达到65%的成功率,优于Anthropic的42%。该模型专注于实际任务,如预订机票、处理电子邮件和购物,通过模拟真实世界网站界面提升可靠性。模型与UI-licious合作,利用其UI测试技术和大规模数据集,开发出能理解指令并导航复杂界面的AI代理。此外,该代理还能自动发现并报告bug,有助于提升整体自动化效率。

🔍 Featherless.ai Action-R1模型在REAL基准测试中达到65%的成功率,显著优于Anthropic的42%,特别是在Amazon克隆网站上实现100%成功率,展现卓越的可靠性。

🛠️ 该模型通过与UI-licious合作,利用其UI测试技术和大规模数据集,开发出能理解指令并导航复杂界面的AI代理,有效解决传统RPA工具易崩溃的问题。

🐛 AI代理不仅能完成任务,还能自动发现并报告bug,进一步提升自动化效率,减少人工干预,为企业节省维护成本。

🤝 Featherless.ai计划与行业伙伴合作,为会计、法律等领域开发特定行业的可靠AI代理,推动AI在更广泛领域的应用。

🚀 提升AI可靠性将解锁90%未投入生产的AI项目,使AI在日日常务任务中更具实用性,推动AI在企业市场的普及。

Most companies and use cases do not care about PhD-level capabilities. They just want an AI model to depend on with absolute reliability, for the simple tasks in life.

Imagine a highly competent, dependable, reliable assistant, who does all your chores–instead of the caffeine-overloaded genius, who sometimes works when it “feels like it”--but otherwise causes more harm than good half the time.

That's what we just built with the Featherless.ai Action-R1 model & agent, which achieved SotA (State-of-the-Art) in the REAL (Realistic Evaluations for Agents Leaderboard) benchmark.

realevals.xyz score benchmark

Our AI agent (test result at link) achieved a 65% success rate on the REAL benchmark. This makes it nearly 50% better than Anthropic computer use (result link), the next best model & framework by a major lab, at 42%.


What is the REAL benchmark?

What is interesting about the REAL benchmark is that it tests over 110 practical real-world tasks. These are not university-level knowledge tasks, but real-world chores that reflect what people do online every day, like booking flights, organizing and replying emails, and shopping for groceries. Basically, office desk work.

These are tested using a controlled test replica, which mirrors 11 major websites, including Airbnb, Amazon, Gmail, LinkedIn, and Uber.

The paper for the benchmark can be found here.


Why does reliability matter?

Side by Side reliability comparisons for top use cases

Frontier AI agents have been plateauing at around 43% overall task completion. We hit 65%, leapfrogging it by 22 percentage points.

More importantly, on certain sites, such as the Omnizon (aka Amazon clone), our agent reached a 100% success rate, while the next best managed 60%.

And the 99%+ distinction matters.

Because you can now rely on this model and agent for tasks in this domain.

If a model is only at 50% success rate at a category of task, you spend more time and energy “babysitting” the models. In several cases, the model will require human intervention which can take more bandwidth than the task itself.

Raising reliability means that you can hand off tasks to the AI, eliminating the frustration many workers face with AI in the enterprise.

For the first time, we have built an AI agent you can rely on for tasks within a handful of platforms with 99%+ success. We will eventually expand to 99%+ reliability for all web platforms.


How we did it

We partnered with the team at UI-licious, who automate end-to-end testing for their customers with AI at scale, building on their experience and knowledge in UI test automation and their proprietary PetaByte scale dataset which they have built up over the years on UI testing.

Together, we co-built a specialized AI action model and agent harness. This helps the AI understand instructions and how to navigate complex and dynamic UIs.

As a bonus, these AI agents are not just navigating and completing tasks. They are finding and issuing bug reports, which we will be following up with the eval maintainers on 😉

Hopefully will be passed to another AI to fix the bugs!

The bigger picture: attacking multi-billion dollar markets

The $3B+ Robotic Process Automation (RPA) market and the $20B+ UI testing market are dominated by brittle tools that break easily and require constant maintenance, where AI agents achieve about 60% reliability.

Today’s RPA tools like UIPath operate like glorified macros, breaking when user interfaces change or when variability is introduced. Businesses spend millions maintaining these systems, fixing broken RPA scripts.

Agentic AI is different. It learns. It adapts. It interacts with interfaces the way humans do: observing, interpreting, and acting based on context. It allows automation over dynamic interfaces.

And we’re just getting started.

As we scale our AI research in the open source space, we will be working alongside industry partners to build reliable AI Agents and models for their own industry specific use, with their own proprietary datasets.

So we will be increasing reliability, not just in automated UI testing, but in all future domains as well, with industry partners from accounting to legal.

Raising reliability means unlocking the 90% of AI projects that fail to enter production within enterprises, as the AI was just “not reliable enough” for day-to-day office tasks.

Raising reliability means bringing into production your AI agent into the market.

If you are interested in partnering with us, and making your AI more SoTA-reliable, please reach out to us.


How can we try this new action model & agent?

Look out for a follow-up launch announcement with our design partners at UI-licious.

This AI action model and agent will be co-launched on both platforms.

For early access, you can sign up here:
https://forms.gle/wxwQ2z12xf1KzFvPA

Priority access will be given to portfolio companies among our investors and users.

Thanks for reading Featherless AI - recursive dev blog! Subscribe for free to receive new posts and support my work.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Featherless.ai AI可靠性 REAL基准测试 UI自动化 AI代理
相关文章