Steampunk AI 09月30日 19:06
AI规划与性能评估进展
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了近期AI领域的几项重要进展:斯坦福大学团队通过PDDL训练LLM提升规划能力;OpenAI评估LLM在10大行业关键任务中的表现接近人类专家;Nature论文提出内存计算注意力机制提升LLM效率;ChatGPT推出Pulse功能整合个人日历信息;Google发布时间序列模型实现少样本学习。这些成果展示了AI在逻辑推理、实际应用和效率优化方面的突破。

🔍 使用PDDL(规划问题描述语言)作为训练数据,斯坦福团队成功提升LLM的逻辑规划能力,证明形式化表示能增强复杂逻辑任务的处理一致性。

💼 OpenAI评估显示,顶尖LLM在10大行业关键任务中表现接近人类专家,尤其在金融、医疗等领域,年薪总价值约3-4万亿美元,预示AI将快速替代人力。

⚡ Nature论文提出的模拟内存计算注意力机制,通过预充电和增量更新减少GPU内存操作,使GPT-2级模型计算延迟降低2-4个数量级,可能大幅节能。

📱 ChatGPT新增的Pulse功能整合日历信息,提供活动建议,或将成为个人数据竞争的关键领域,对Google和Apple构成威胁。

📊 Google发布的时间序列基础模型能通过少量样本学习预测数据趋势,突破此前需要领域适配的技术局限,提升商业预测准确性。

Saturday Links: PDDL and Symbolic Planning, GDPEval, and Grabbing Context

Planning in LLMs, efficient attention mechanisms and breakthroughs in time series models.

This week, I was at the excellent APIDays "No AI with no APIs" event in London. Thank you so much to the team for the very kind invitation. A link to my talk slides is here; a longer write-up on that is coming up soon. In a busy week, Exa releases an MCP server, Salesforce's Mulesoft acquisition pays off even more with an entry into the agent frameworks market, and the TikTok algorithm looks like it will be managed by Oracle.

On to the main eye-catching bits of news. This week, with a scientific/technical lean:

    Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning. In this paper on LLM reasoning, a team at Stanford successfully used training data representing planning tasks and solutions to train an LLM to improve its logical planning ability. The representation of plan challenges and solutions they used was a real blast from the past: PDDL, which was created in 1998. You can find the formal technical report here. The results clearly show that formal representation can help an LLM gain greater consistency in handling complex logical challenges. My guess is, though, that to get very high accuracy on planning tasks will require adding on an actual logical reasoning engine as a tool that the LLM can call.GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. In this paper, OpenAI researchers carried out an extensive evaluation of LLM performance on tasks that are directly relevant to job roles held by humans in key industries today. The report starts by looking at the top 10 US industries by GDP contribution, identifying the key roles within each industry, and specifying the key tasks being carried out by experts. One of the most eye-opening parts of the paper is the final Annex, which lists the industries and roles along with estimated total remuneration for these roles. Across this (very approximately) $3-$4 trillion of annual remuneration, leading LLMs performed close to human expert level competence in many tasks. With human oversight, many of the tasks can be performed well (though in unspervised mode, there are also higher risks of critical failures). The results are very impressive, and models are still improving. Not only that, this is raw model performance. In many industries, a multitude of startups are building scaffolding and support for tasks that should improve outcomes. The team also open-sourced a set of 220 golden task examples. With results like this, it is hard to argue that AI solutions will not be eating into human labour budgets (and not just software) quickly.Analog in-memory computing attention mechanisms for fast, energy-efficient large-language models. One of the challenges with the core attention mechanism in LLM transformers is that they require constant moving of tokens in and out of GPU memory. In this paper, published in Nature, the authors describe a mechanism that makes it possible to prime attention memory and update it incrementally. In their experiments on initial model (GPT-2 class) the methods have a between a two and four order of magnitude reduction in compute latency (speed) and potentially energy usage. The results need to be validated and tried at scale, but if realized, these techniques could lead to significant efficiency gains. Introducing ChatGPT Pulse. Or otherwise known as the next shot in the war for personal context. ChatGPT now has a new feature on mobile that pops up suggestions and curated information for your activities that day. It can also connect to your calendar, which may be the real goal here. The service is useful sounding (though I'd argue you might be better off not killing your morning vibe with more pop-ups). This poses a significant threat to Google and Apple, as it involves taking over another aspect of personal information, including context and screen time. A push to integrate your email and DMs might not be that far down the line.Time series foundation models can be few-shot learners. Rounding off the week with another scientific post. This week, Google released a model that is strong at predicting continuations of time series, which is an essential function in modern business. Previous techniques already made breakthroughs in that they created models that needed no domain-specific adaptation to work credibly. However, in time series tasks, accuracy is king, and in this new work, the Google team shows that with just a few injected examples, models can significantly improve in performance.

Wishing you a great weekend.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM规划 PDDL 性能评估 注意力机制 时间序列模型 AI效率 个人数据
相关文章