VentureBeat 前天 00:44
AI代理:核心组件与工作原理
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入解析了当前热门的AI代理技术,阐述了其核心组成部分与运作机制。AI代理将生成式AI的能力从聊天界面延伸至现实世界,极大地增强了AI模型的功用。文章介绍了代理开发框架、运行环境、LLM与工具调用间的转换机制、短期与长期记忆系统、授权机制以及执行追踪等关键要素。通过“ReAct”模型(推理+行动)的示例,说明了代理如何通过工具调用循环逐步达成用户目标。同时,文章也探讨了如Firecracker等技术在安全高效部署代理方面的应用,以及MCP协议在代理与工具通信中的作用。最终,AI代理的强大能力得益于这些组件的协同工作。

🤖 **AI代理的核心理念与运作模式**:AI代理通过将大型语言模型(LLM)与外部工具结合,实现自主执行任务。其基本工作流程是:用户设定一个目标,LLM调用一个工具来获取信息或执行操作,然后根据工具的反馈,LLM可能再调用其他工具,如此循环直至目标达成。这种“工具调用循环”是AI代理实现复杂任务的关键。

🛠️ **构建与运行AI代理的关键基础设施**:构建AI代理需要一系列核心组件。这包括易于使用的代理开发框架,用于安全可扩展运行代理的云端环境(如利用Firecracker创建的微VM),LLM与工具调用之间进行信息转换的机制,用于跟踪交互内容的短期记忆,用于存储用户偏好信息的长期记忆,处理权限的授权系统,以及用于评估代理性能的执行追踪功能。

💡 **“ReAct”模型与工具集成**:AI代理常采用“ReAct”(推理+行动)模型,即代理在执行动作前会先进行思考,然后执行动作,并根据执行结果进行观察和调整。开发框架允许开发者定义自然语言目标和代理可使用的工具,并可以动态生成新工具。这使得代理能够更灵活地适应任务需求,例如,代理可以被指示生成Python代码来处理数据。

🔒 **安全与记忆管理**:AI代理的运行需要严格的授权机制,包括用户对代理的授权以及代理访问网络资源时的自身授权,通常通过OAuth等协议实现。同时,短期记忆用于存储当前会话中的关键信息,避免冗余的工具调用;长期记忆则用于跨会话保存用户偏好和历史记录,以提供更个性化的服务。会话结束后,短期记忆中的信息会被提炼并存入长期记忆。

Agents are the trendiest topic in AI today — and with good reason. Taking gen AI out of the protected sandbox of the chat interface and allowing it to act directly on the world represents a leap forward in the power and utility of AI models.

The word “agent” has been used in different ways, however, and there have been some overheated claims about what agents can do. The rhetoric, the willful obfuscation and the rapid evolution of the field have left a lot of people confused.

To cut through the noise, I’d like to describe the core components of an agentic AI system and how they fit together: It’s really not as complicated as it may seem. Hopefully, when you’ve finished reading this post, agents won’t seem as mysterious.

Agentic ecosystem

Although definitions of the word “agent” abound, I like the British programmer Scott Williston’s minimalist take:

An LLM agent runs tools in a loop to achieve a goal.

The user prompts a large language model (LLM) with a goal: Say, booking a table at a restaurant near a specific theater. The LLM then calls a tool: Say, a database of restaurant locations. The tool provides a response, passes to the LLM, and the LLM calls a new tool. Through repetitions, the agent moves toward accomplishing the goal.

But what kind of infrastructure does it take to realize this approach? An agentic system needs a few core components:

Let's dive into more detail on each of these components.

Building an agent

Asking an LLM to explain how it plans to approach a particular task improves its performance on that task. This “chain-of-thought reasoning” is now ubiquitous in AI.

The analogue in agentic systems is the ReAct (reasoning + action) model, in which the agent has a thought (“I’ll use the map function to locate nearby restaurants”), performs an action (issuing an API call to the map function), then makes an observation (“There are two pizza places and one Indian restaurant within two blocks of the movie theater”).

Agents aren’t required to use the ReAct framework, but it’s proven highly effective. Today, agents are commonly looped over the thought-action-observation sequence.

With an agent development framework, the developer defines a goal using natural language, then specifies the tools that the agent can use to achieve that goal, such as databases and microservices. The tool specifications include a natural-language explanation of the context and purpose of the tool’s use, as well as a description of the syntax of the tool’s available API calls.

The developer can also tell the agent to build its own tools on the fly. Let’s say, for instance, one tool retrieves a table stored as comma-separated text. To fulfill its goal, the agent needs to sort the table.

Sorting a table by repeatedly sending it through an LLM and evaluating the results would be a colossal waste of resources — and it’s not even guaranteed to give the right result. Instead, the developer can simply instruct the agent to generate its own Python code.

Available tools can divide responsibility between the LLM and the developer. Once the tools available to the agent have been specified, the developer can simply instruct the agent what tools to use when necessary. Or, the developer can specify which tool to use for which types of data, and even which data items to use as arguments during function calls.

Similarly, the developer can tell the LLM to generate Python code for automating repetitive tasks when necessary or, alternatively, tell the agent which algorithms to use for which data types and even provide pseudocode. The approach can vary from agent to agent.

Runtime

Historically, there were two main ways to isolate code running on shared servers: Containerization, which was efficient but offered lower security; and virtual machines, which were secure but came with a lot of computational overhead.

In 2018, Amazon Web Services’ (AWS’s) Lambda serverless-computing service deployed Firecracker, a new form of server isolation. Firecracker creates “microVMs” that have reduced functionality but comparably low overhead, so that each function executed on a Lambda server can have its own microVM.

Because instantiating an agent requires deploying an LLM, together with the memory resources to track the LLM’s inputs and outputs, the per-function isolation model is impractical. Instead, session-based isolation can assign every agent its own Firecracker microVM, complete with computational capacity, memory and file system resources. When the session finishes, the LLM’s state information is copied to long-term memory, and the microVM is destroyed. This ensures secure and efficient deployment.

Tool calls

Just as there are several existing development frameworks for agent creation, there are several existing standards for communication between agents and tools, the most popular of which — currently — is the model context protocol (MCP).

MCP establishes a one-to-one connection between the agent’s LLM and a dedicated MCP server that executes tool calls, and it also establishes a standard format for passing different types of data back and forth between the LLM and its server.

Sometimes, however, the necessary tool is not one with an available API. In such cases, the only way to retrieve data or perform an action is through cursor movements and clicks on a website. There are a number of services available to perform such computer use and handle the translations back and forth between the computer use service and the LLM interface.

Authorizations

With agents, authorization works in two directions. First, of course, users require authorization to run the agents they’ve created. But as the agent is acting on the user’s behalf, it will usually require its own authorization to access networked resources.

There are a few different ways to approach the problem of authorization. One is with an access delegation algorithm like OAuth, which essentially plumbs the authorization process through the agentic system. The user enters login credentials into OAuth, and the agentic system uses OAuth to log into protected resources, but the agentic system never has direct access to the user’s passwords.

In the other approach, the user logs into a secure session on a server, and the server has its own login credentials on protected resources. Permissions allow the user to select from a variety of authorization strategies and algorithms for implementing those strategies.

Memory and traces

Short-term memory

LLMs are next-word prediction machines. What makes them so astoundingly fluent is that their prediction of the next word in a sequence is based on the semantic embeddings of all the words they’ve already seen.

That's the LLM’s context, which is in itself a kind of memory. But it’s not the only kind of memory an agentic system needs.

Suppose, again, that an agent is trying to book a restaurant near a movie theater, and from a map tool, it’s retrieved a couple dozen restaurants within a mile radius. It doesn’t want to dump information about all those restaurants into the LLM’s context: All that extraneous information could wreak havoc with next-word probabilities.

Instead, it can store the complete list in short-term memory and — again, using embeddings — and retrieve one or two records at a time, based on the user’s price and cuisine preferences and proximity to the theater. If none of those restaurants pans out, the agent can dip back into short-term memory, rather than having to execute another tool call.

Long-term memory

At the end of a session, the agent’s microVM — including the context and the contents of short-term memory — is destroyed. But we will often want to retain information about a particular user from one session to another. If you booked airline reservations today, you’ll want the agent to remember your destination and travel dates when you send it to book a hotel tomorrow.

At the end of every session, then, the context and the contents of short-term memory are extracted and distilled for storage in long-term memory. The distillation process can include summarization and embedding for later vector-based retrieval. It can also involve “chunking”, in which documents are split up into sections grouped according to topic. Each topic — or the text chunks dealing with a topic — is then embedded, to make subsequent vector-based retrieval easy. Users can select summarization, embedding, chunking, and other distillation strategies and algorithms.

Finally, in addition to storing context and content, agent systems record the API calls and responses corresponding to the inputs and outputs to the LLM. This permits manual review to evaluate agents’ performance.

Conclusion

That’s it! Of course, it takes a lot of engineering to get all these components to work together and to run efficiently on specific hardware, but in broad strokes, that’s how agentic systems work.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI代理 大型语言模型 工具调用 ReAct模型 AI技术 AI Agents LLM Tool Use ReAct Framework AI Technology
相关文章