VentureBeat 10月08日 02:38
AI代理可靠性突破:Apollo-1赋能企业级任务执行
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

长期以来,对话式AI在完成用户指令方面存在可靠性瓶颈。即使是先进的大型语言模型(LLM)在执行浏览器任务等场景下,表现也远未达到企业级要求。纽约公司Augmented Intelligence(AUI)推出的新基础模型Apollo-1,采用“状态化神经符号推理”混合架构,旨在显著提升AI代理完成任务的可靠性。该模型在航空预订等基准测试中表现出色,通过率远超现有模型,承诺为企业提供可信赖的任务执行能力,弥合了“能说会道”与“能做实事”AI之间的鸿沟。

🎯 **AI任务执行可靠性瓶颈待解**:尽管大型语言模型(LLM)在对话和理解方面取得了显著进展,但在实际执行复杂任务,如浏览器操作方面,其可靠性仍远低于企业用户的期望。例如,在Terminal-Bench Hard等第三方基准测试中,即使是最好的AI模型也仅达到30%左右的准确率,而TAU-Bench airline等专注于特定任务的基准测试,其通过率也仅为56%,表明AI在执行指令时失败率较高。

🚀 **Apollo-1:突破性混合架构**:AUI公司推出的Apollo-1模型,采用了一种名为“状态化神经符号推理”的混合架构。这种架构结合了符号推理的结构化优势和神经网络的语言流畅性,旨在确保AI在每一次客户互动中都能产生一致且符合策略的结果。它通过一个封闭的推理循环,将自然语言转化为符号状态,由状态机维护,决策引擎决定下一步行动,规划器执行,解码器将结果转为语言,从而实现确定性的任务完成。

✅ **确定的行为而非概率预测**:与Transformer模型基于下一个词元预测的生成方式不同,Apollo-1预测的是对话中的下一个动作。它通过“系统提示”(System Prompt)定义行为契约,确保AI严格按照设定的规则和策略执行任务,例如在特定金额退款时必须进行身份验证,或在支付前始终提供保险选项。这种“行为确定性”是纯粹生成式模型难以实现的,对金融、航空等对准确性要求极高的行业至关重要。

📈 **卓越的基准测试表现**:Apollo-1在多项基准测试中展现出惊人的可靠性提升。在TAU-Bench Airline测试中,其通过率高达92.5%,远超其他模型。在实际的航班预订对话中,Apollo-1在Google Flights上的完成率为83%,远高于Gemini 2.5-Flash的22%;在Amazon零售场景中,完成率为91%,远高于Rufus的17%。这些数据表明Apollo-1在任务完成的可靠性上实现了数量级的飞跃。

🤝 **AI的协同而非替代**:AUI公司将Apollo-1定位为大型语言模型的必要补充,而非替代品。LLM擅长开放式对话和创意探索,而Apollo-1则专注于任务导向型对话的确定性执行。两者的结合能够构成完整的对话式AI能力谱。Apollo-1已在金融、旅游、零售等行业的财富500强公司中进行试点,并计划于2025年11月全面开放API,提供更广泛的功能。

For more than a decade, conversational AI has promised human-like assistants that can do more than chat. Yet even as large language models (LLMs) like ChatGPT, Gemini, and Claude learn to reason, explain, and code, one critical category of interaction remains largely unsolved — reliably completing tasks for people outside of chat.

Even the best AI models score only in the 30th percentile on Terminal-Bench Hard, a third-party benchmark designed to evaluate the performance of AI agents on completing a variety of browser-based tasks, far below the reliability demanded by most enterprises and users. And task-specific benchmarks like TAU-Bench airline, which measures the reliability of AI agents on finding and booking flights on behalf of a user, also don't have much higher pass rates, with only 56% for the top performing agents and models (Claude 3.7 Sonnet) — meaning the agent fails nearly half the time.

New York City-based Augmented Intelligence (AUI) Inc., co-founded by Ohad Elhelo and Ori Cohen, believes it has finally come with a solution to boost AI agent reliability to a level where most enterprises can trust they will do as instructed, reliably.

The company’s new foundation model, called Apollo-1 — which remains in preview with early testers now but is close to an impending general release — is built on a principle it calls stateful neuro-symbolic reasoning.

It's a hybrid architecture championed by even LLM skeptics like Gary Marcus, designed to guarantee consistent, policy-compliant outcomes in every customer interaction.

“Conversational AI is essentially two halves,” said Elhelo in a recent interview with VentureBeat. “The first half — open-ended dialogue — is handled beautifully by LLMs. They’re designed for creative or exploratory use cases. The other half is task-oriented dialogue, where there’s always a specific goal behind the conversation. That half has remained unsolved because it requires certainty.”

AUI defines certainty as the difference between an agent that “probably” performs a task and one that almost “always” does.

For example, on TAU-Bench Airline, it performs at a staggering 92.5% pass rate, leaving all the other current competitors far behind in the dust — according to benchmarks shared with VentureBeat and posted on AUI's website.

Elhelo offered simple examples: a bank that must enforce ID verification for refunds over $200, or an airline that must always offer a business-class upgrade before economy.

“Those aren’t preferences,” he said. “They’re requirements. And no purely generative approach can deliver that kind of behavioral certainty.”

AUI and its work on improving reliability was previously covered by subscription news outlet The Information, but has not received widespread coverage in publicly accessible media — until now.

From Pattern Matching to Predictable Action

The team argues that transformer models, by design, can’t meet that bar. Large language models generate plausible text, not guaranteed behavior. “When you tell an LLM to always offer insurance before payment, it might — usually,” Elhelo said. “Configure Apollo-1 with that rule, and it will — every time.”

That distinction, he said, stems from the architecture itself. Transformers predict the next token in a sequence. Apollo-1, by contrast, predicts the next action in a conversation, operating on what AUI calls a typed symbolic state.

Cohen explained the idea in more technical terms. “Neuro-symbolic means we’re merging the two dominant paradigms,” he said. “The symbolic layer gives you structure — it knows what an intent, an entity, and a parameter are — while the neural layer gives you language fluency. The neuro-symbolic reasoner sits between them. It’s a different kind of brain for dialogue.”

Where transformers treat every output as text generation, Apollo-1 runs a closed reasoning loop: an encoder translates natural language into a symbolic state, a state machine maintains that state, a decision engine determines the next action, a planner executes it, and a decoder turns the result back into language. “The process is iterative,” Cohen said. “It loops until the task is done. That’s how you get determinism instead of probability.”

A Foundation Model for Task Execution

Unlike traditional chatbots or bespoke automation systems, Apollo-1 is meant to serve as a foundation model for task-oriented dialogue — a single, domain-agnostic system that can be configured for banking, travel, retail, or insurance through what AUI calls a System Prompt.

“The System Prompt isn’t a configuration file,” Elhelo said. “It’s a behavioral contract. You define exactly how your agent must behave in situations of interest, and Apollo-1 guarantees those behaviors will execute.”

Organizations can use the prompt to encode symbolic slots — intents, parameters, and policies — as well as tool boundaries and state-dependent rules.

A food delivery app, for example, might enforce “if allergy mentioned, always inform the restaurant,” while a telecom provider might define “after three failed payment attempts, suspend service.” In both cases, the behavior executes deterministically, not statistically.

Eight Years in the Making

AUI’s path to Apollo-1 began in 2017, when the team started encoding millions of real task-oriented conversations handled by a 60,000-person human agent workforce.

That work led to a symbolic language capable of separating procedural knowledge — steps, constraints, and flows — from descriptive knowledge like entities and attributes.

“The insight was that task-oriented dialogue has universal procedural patterns,” said Elhelo. “Food delivery, claims processing, and order management all share similar structures. Once you model that explicitly, you can compute over it deterministically.”

From there, the company built the neuro-symbolic reasoner — a system that uses the symbolic state to decide what happens next rather than guessing through token prediction.

Benchmarks suggest the architecture makes a measurable difference.

In AUI’s own evaluations, Apollo-1 achieved over 90 percent task completion on the τ-Bench-Airline benchmark, compared with 60 percent for Claude-4.

It completed 83 percent of live booking chats on Google Flights versus 22 percent for Gemini 2.5-Flash, and 91 percent of retail scenarios on Amazon versus 17 percent for Rufus.

“These aren’t incremental improvements,” said Cohen. “They’re order-of-magnitude reliability differences.”

A Complement, Not a Competitor

AUI isn’t pitching Apollo-1 as a replacement for large language models, but as their necessary counterpart. In Elhelo’s words: “Transformers optimize for creative probability. Apollo-1 optimizes for behavioral certainty. Together, they form the complete spectrum of conversational AI.”

The model is already running in limited pilots with undisclosed Fortune 500 companies across sectors including finance, travel, and retail.

AUI has also confirmed a strategic partnership with Google and plans for general availability in November 2025, when it will open APIs, release full documentation, and add voice and image capabilities. Interested potential customers and partners can sign up to receive more information when it becomes available on AUI's website form.

Until then, the company is keeping details under wraps. When asked about what comes next, Elhelo smiled. “Let’s just say we’re preparing an announcement,” he said. “Soon.”

Toward Conversations That Act

For all its technical sophistication, Apollo-1’s pitch is simple: make AI that businesses can trust to act — not just talk. “We’re on a mission to democratize access to AI that works,” Cohen said near the end of the interview.

Whether Apollo-1 becomes the new standard for task-oriented dialogue remains to be seen. But if AUI’s architecture performs as promised, the long-standing divide between chatbots that sound human and agents that reliably do human work may finally start to close.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI代理 任务执行 可靠性 Apollo-1 神经符号推理 对话式AI 大型语言模型 AUI Inc. 基准测试 企业级AI
相关文章