MarkTechPost@AI 09月03日
构建具备短期和长期记忆的AI代理
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本教程详细介绍了如何构建一个不仅能进行对话,还能记住对话内容的先进AI代理。我们从零开始,结合轻量级语言模型、FAISS向量搜索和摘要机制,实现了短期和长期记忆功能。通过嵌入和自动蒸馏的事实,我们能够创建一个适应指令、在未来对话中回忆重要细节,并智能压缩上下文的代理,确保交互的流畅与高效。

💡 **核心技术栈**:构建AI代理的核心在于整合轻量级语言模型(如TinyLlama)、FAISS向量搜索库和句子嵌入模型(如all-MiniLM-L6-v2)。通过这些技术,代理能够理解、生成文本,并高效地存储和检索信息。

🗄️ **向量化记忆(长期记忆)**:利用`VectorMemory`类,可以将重要的用户偏好、事实或任务信息转化为向量嵌入,并存储在FAISS索引中。这使得代理能够根据用户输入进行语义搜索,快速召回相关历史信息,实现“长期记忆”。记忆数据被持久化到JSON文件中,确保跨会话的记忆保持。

📝 **上下文管理(短期记忆)**:代理通过维护一个有限数量的对话轮次(`max_turns`)来管理短期记忆。当对话轮次超过阈值时,会利用语言模型对近期对话进行摘要,将关键信息浓缩到`self.summary`中,并丢弃过早的对话记录,以保持上下文的简洁和高效。

🧠 **智能交互设计**:代理的关键在于其`ask`方法,该方法首先尝试从用户输入中提炼“耐用信息”并进行长期存储,然后通过召回相关长期记忆和近期对话摘要来构建提示,最后生成回复。这种多层次的记忆和上下文处理机制,使得代理能够提供更个性化、更连贯的交互体验。

In this tutorial, we walk you through building an advanced AI Agent that not only chats but also remembers. We start from scratch and demonstrate how to combine a lightweight LLM, FAISS vector search, and a summarization mechanism to create both short-term and long-term memory. By working together with embeddings and auto-distilled facts, we can craft an agent that adapts to our instructions, recalls important details in future conversations, and intelligently compresses context, ensuring the interaction remains smooth and efficient. Check out the FULL CODES here.

!pip -q install transformers accelerate bitsandbytes sentence-transformers faiss-cpuimport os, json, time, uuid, math, refrom datetime import datetimeimport torch, faissfrom transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfigfrom sentence_transformers import SentenceTransformerDEVICE = "cuda" if torch.cuda.is_available() else "cpu"

We begin by installing the essential libraries and importing all the required modules for our agent. We set up the environment to determine whether we are using a GPU or a CPU, allowing us to run the model efficiently. Check out the FULL CODES here.

def load_llm(model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"):   try:       if DEVICE=="cuda":           bnb=BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_compute_dtype=torch.bfloat16,bnb_4bit_quant_type="nf4")           tok=AutoTokenizer.from_pretrained(model_name, use_fast=True)           mdl=AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb, device_map="auto")       else:           tok=AutoTokenizer.from_pretrained(model_name, use_fast=True)           mdl=AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, low_cpu_mem_usage=True)       return pipeline("text-generation", model=mdl, tokenizer=tok, device=0 if DEVICE=="cuda" else -1, do_sample=True)   except Exception as e:       raise RuntimeError(f"Failed to load LLM: {e}")

We define a function to load our language model. We set it up so that if a GPU is available, we use 4-bit quantization for efficiency; otherwise, we fall back to the CPU with optimized settings. This ensures we can generate text smoothly regardless of the hardware we are running on. Check out the FULL CODES here.

class VectorMemory:   def __init__(self, path="/content/agent_memory.json", dim=384):       self.path=path; self.dim=dim; self.items=[]       self.embedder=SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device=DEVICE)       self.index=faiss.IndexFlatIP(dim)       if os.path.exists(path):           data=json.load(open(path))           self.items=data.get("items",[])           if self.items:               X=torch.tensor([x["emb"] for x in self.items], dtype=torch.float32).numpy()               self.index.add(X)   def _emb(self, text):       v=self.embedder.encode([text], normalize_embeddings=True)[0]       return v.tolist()   def add(self, text, meta=None):       e=self._emb(text); self.index.add(torch.tensor([e]).numpy())       rec={"id":str(uuid.uuid4()),"text":text,"meta":meta or {}, "emb":e}       self.items.append(rec); self._save(); return rec["id"]   def search(self, query, k=5, thresh=0.25):       if len(self.items)==0: return []       q=self.embedder.encode([query], normalize_embeddings=True)       D,I=self.index.search(q, min(k, len(self.items)))       out=[]       for d,i in zip(D[0],I[0]):           if i==-1: continue           if d>=thresh: out.append((d,self.items[i]))       return out   def _save(self):       slim=[{k:v for k,v in it.items()} for it in self.items]       json.dump({"items":slim}, open(self.path,"w"), indent=2)

We create a VectorMemory class that gives our agent long-term memory. We store past interactions as embeddings using MiniLM and index them with FAISS, allowing us to search and recall relevant information later. Each memory is saved to disk, enabling the agent to retain its memory across sessions. Check out the FULL CODES here.

def now_iso(): return datetime.now().isoformat(timespec="seconds")def clamp(txt, n=1600): return txt if len(txt)<=n else txt[:n]+" …"def strip_json(s):   m=re.search(r"\{.*\}", s, flags=re.S);   return m.group(0) if m else NoneSYS_GUIDE = ("You are a helpful, concise assistant with memory. Use provided MEMORY when relevant. ""Prefer facts from MEMORY over guesses. Answer directly; keep code blocks tight. If unsure, say so.")SUMMARIZE_PROMPT = lambda convo: f"Summarize the conversation below in 4-6 bullet points focusing on stable facts and tasks:\n\n{convo}\n\nSummary:"DISTILL_PROMPT = lambda user: (f"""Decide if the USER text contains durable info worth long-term memory (preferences, identity, projects, deadlines, facts).Return compact JSON only: {{"save": true/false, "memory": "one-sentence memory"}}.USER: {user}""")class MemoryAgent:   def __init__(self):       self.llm=load_llm()       self.mem=VectorMemory()       self.turns=[]           self.summary=""          self.max_turns=10   def _gen(self, prompt, max_new_tokens=256, temp=0.7):       out=self.llm(prompt, max_new_tokens=max_new_tokens, temperature=temp, top_p=0.95, num_return_sequences=1, pad_token_id=self.llm.tokenizer.eos_token_id)[0]["generated_text"]       return out[len(prompt):].strip() if out.startswith(prompt) else out.strip()   def _chat_prompt(self, user, memory_context):       convo="\n".join([f"{r.upper()}: {t}" for r,t in self.turns[-8:]])       sys=f"System: {SYS_GUIDE}\nTime: {now_iso()}\n\n"       mem = f"MEMORY (relevant excerpts):\n{memory_context}\n\n" if memory_context else ""       summ=f"CONTEXT SUMMARY:\n{self.summary}\n\n" if self.summary else ""       return sys+mem+summ+convo+f"\nUSER: {user}\nASSISTANT:"   def _distill_and_store(self, user):       try:           raw=self._gen(DISTILL_PROMPT(user), max_new_tokens=120, temp=0.1)           js=strip_json(raw)           if js:               obj=json.loads(js)               if obj.get("save") and obj.get("memory"):                   self.mem.add(obj["memory"], {"ts":now_iso(),"source":"distilled"})                   return True, obj["memory"]       except Exception: pass       if re.search(r"\b(my name is|call me|I like|deadline|due|email|phone|working on|prefer|timezone|birthday|goal|exam)\b", user, flags=re.I):           m=f"User said: {clamp(user,120)}"           self.mem.add(m, {"ts":now_iso(),"source":"heuristic"})           return True, m       return False, ""   def _maybe_summarize(self):       if len(self.turns)>self.max_turns:           convo="\n".join([f"{r}: {t}" for r,t in self.turns])           s=self._gen(SUMMARIZE_PROMPT(clamp(convo, 3500)), max_new_tokens=180, temp=0.2)           self.summary=s; self.turns=self.turns[-4:]   def recall(self, query, k=5):       hits=self.mem.search(query, k=k)       return "\n".join([f"- ({d:.2f}) {h['text']} [meta={h['meta']}]" for d,h in hits])   def ask(self, user):       self.turns.append(("user", user))       saved, memline = self._distill_and_store(user)       mem_ctx=self.recall(user, k=6)       prompt=self._chat_prompt(user, mem_ctx)       reply=self._gen(prompt)       self.turns.append(("assistant", reply))       self._maybe_summarize()       status=f" memory_saved: {saved}; " + (f"note: {memline}" if saved else "note: -")       print(f"\nUSER: {user}\nASSISTANT: {reply}\n{status}")       return reply

We bring everything together into the MemoryAgent class. We design the agent to generate responses with context, distill important facts into long-term memory, and periodically summarize conversations to manage short-term context. With this setup, we create an assistant that remembers, recalls, and adapts to our interactions with it. Check out the FULL CODES here.

agent=MemoryAgent()print(" Agent ready. Try these:\n")agent.ask("Hi! My name is Nicolaus, I prefer being called Nik. I'm preparing for UPSC in 2027.")agent.ask("Also, I work at  Visa in analytics and love concise answers.")agent.ask("What's my exam year and how should you address me next time?")agent.ask("Reminder: I like agentic RAG tutorials with single-file Colab code.")agent.ask("Given my prefs, suggest a study focus for this week in one paragraph.")

We instantiate our MemoryAgent and immediately exercise it with a few messages to seed long-term memories and verify recall. We confirm it remembers our preferred name and exam year, adapts replies to our concise style, and uses past preferences (agentic RAG, single-file Colab) to tailor study guidance in the present.

In conclusion, we see how powerful it is when we give our AI Agent the ability to remember. We now have an agent that stores key details, recalls them when relevant, and summarizes conversations to stay efficient. This approach keeps our interactions contextual and evolving, making the agent feel more personal and intelligent with each exchange. With this foundation, we are ready to extend memory further, explore richer schemas, and experiment with more advanced memory-augmented agent designs.


Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Build an Advanced AI Agent with Summarized Short-Term and Vector-Based Long-Term Memory appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI Agent Memory LLM FAISS Vector Search Summarization TinyLlama RAG Artificial Intelligence Machine Learning
相关文章