少点错误 10月22日 10:49
AI与形式化方法:进展、挑战与未来方向
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文回顾了AI与形式化方法交叉领域的最新进展,重点关注Theorem在形式化方法与属性基测试结合上的创新,以及APE项目在自动化证明工程方面的尝试。文章探讨了语言模型在处理Lean等数学库版本控制和复杂证明任务时的挑战,并介绍了Cobblestone项目如何利用LLM进行分解式证明合成。此外,还讨论了AI安全领域对形式化方法的需求,以及VC投资在这一领域的涌入。最后,文章展望了AI与形式化方法在未来AI安全和软件质量保证中的关键作用,并探讨了对现有通讯内容的重新审视和改进。

💡 **Theorem的创新结合**:Theorem公司在形式化方法(Formal Methods)领域展现出令人印象深刻的技术实力,特别是在结合属性基测试(Property Based Testing, PBT)与形式化证明方面。文章强调,PBT因其计算效率高且能带来80%的价值而备受青睐,但可能存在测试盲点。而形式化证明则通过模块化步骤和推理来弥补这些不足,但计算上可能不够高效。Theorem的探索旨在融合两者的优势,以更优的成本实现更高的质量保证。

🚀 **自动化证明工程(APE)与Cobblestone项目**:文章介绍了APE(Automated Proof Engineering)的理念,旨在利用大型语言模型(LLMs)自动化处理现实世界中数学库的工程任务,如功能添加、重构和bug修复。APE-Bench I作为首个真实世界数学库(Mathlib4)提交历史构建的基准,用于评估LLMs在这些任务上的表现。同时,Cobblestone项目提出了一种“分而治之”的证明合成方法,利用LLM生成初步证明,然后分解问题并逐步验证,即使在LLM本身不完全可靠的情况下,也能保证最终证明的健全性,并且成本低廉,效率显著。

🌐 **AI安全与形式化方法的协同**:文章指出,尽管形式化方法在提高软件质量方面效果显著,但其高昂的成本和专业门槛限制了其广泛应用。VC对形式化方法(尤其是Lean)的投资正在增加,但这种投资并不必然转化为AI安全性的提升。作者强调,需要将形式化验证的重点引导至现实世界的AI安全和风险问题上,并呼吁社区关注如何利用这些技术解决实际AI安全挑战,确保AI系统的可靠性和安全性。

🔄 **版本控制与LLM在数学库中的挑战**:对于Lean和Mathlib等高代码生成速度的数学库,其快速迭代和版本控制成为一个关键痛点。文章探讨了语言模型在处理这类高密度、低预训练数据覆盖的场景下的能力限制。虽然近期LLM技术有进步,但在处理像Mathlib这样复杂的、需要工程迭代的数学库时,仍面临显著挑战,特别是在处理跨文件协调、项目级验证和自主代理等方面。

Published on October 22, 2025 2:30 AM GMT

Yall, I really do apologize for radio silence. It has mostly to do with breaking my ankle in three places, but I’m walking again.

This edition of the newsletter looks a bit more like movement happenings and announcements, which isn’t to say that there weren’t more papers or technical results I could’ve included, just that my mind wasn’t on them over the summer. I feel like I should be working on strategic clarity right now! Watch this space, etc.

Theorem blog post

I remain turbo impressed with Theorem’s tech. Formal methods is quality assurance, and as every QA technician learns on the first day of QA School, property based tests are 80% of the value of a proof for 20% of the effort.

Why combine proofs and PBTs? Standard PBTs are compute efficient but prone to gaps: developers miss edge cases, and rare bugs slip into production. Proofs solve this with modular steps that decompose and incorporate reasoning, revealing what cases we’re missing—but they’re computationally impractical.

They’re a little coy with the details. Too coy really, I thought of omitting this from the newsletter but I’ve seen some demos from the team that aren’t public yet that make me bullish on the company, so I want to remind everyone to watch this space.

APE TOGETHER STRONG

No reason this couldn’t have been covered in an earlier episode! I think it slipped through the cracks or I had too many similar papers that month.

I’m pretty interested in the version control problem, in the long game. Every Lean synthesis laborer in the prompt mines every day wiring up their agents, MCP servers, and lean-toolchain files knows that the high code velocity of Lean and Mathlib relative to fewness of tokens in the pretraining data is a major pain point, especially compared to the synthesis successes of larger-resourced languages. What does it look like if language model technology eventually scales to something less hobbled by this?1

To be clear, this question of version control is different from the interest in version control over in Safeguarded/ARIA land that was covered in a previous newsletter. Over there (davidadia? davidadistan?), specifications and world models are a kind of structured data, and they want version control that (among other things) doesn’t just flatten critical specs to plaintext.

Moreover, I don’t think this is necessarily the reason ByteDance got interested in version control for proof synthesis. They may be thinking diff-aware / git history aware approach to Mathlib leads to more synthetic data and a better understanding (in the model) of how to add features, refactor, etc.

Recent progress in large language models (LLMs) has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Motivated by analogous advances in software engineering, we introduce the paradigm of Automated Proof Engineering (APE), which aims to automate proof engineering tasks such as feature addition, proof refactoring, and bug fixing using LLMs. To facilitate research in this direction, we present APE-Bench I, the first realistic benchmark built from real-world commit histories of Mathlib4, featuring diverse file-level tasks described in natural language and verified via a hybrid approach combining the Lean compiler and LLM-as-a-Judge. We further develop Eleanstic, a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib. Empirical results on state-of-the-art LLMs demonstrate strong performance on localized edits but substantial degradation on handling complex proof engineering. This work lays the foundation for developing agentic workflows in proof engineering, with future benchmarks targeting multi-file coordination, project-scale verification, and autonomous agents capable of planning, editing, and repairing formal libraries.

Cobblestone

This was originally posted a whole year ago, but had a revision over the summer. In any case, I had missed it until now.

I basically consider this a neurosymbolic architecture. The hammer (an SMT driven proof search tool) is the symbolic, and the regular LLM parts you know and love are the neuro part. Obviously you can do more than prompt the agent “try using the hammer tactic sometimes if you get stuck, little buddy :) I believe in you”. Their divide-and-conquer is one of those things.

Standard (by now) CoT tricks and agent stuff. It’s further ahead than a couple papers I’ve seen in the Lean agent space that came out since then, so good for them.

Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but requires significant effort and expertise. Machine learning can automatically synthesize proofs, but such tools are able to prove only a fraction of desired software properties. We introduce Cobblestone, a divide-and-conquer approach for proof synthesis. Cobblestone uses a large language model (LLM) to generate potential proofs, uses those proofs to break the problem into simpler parts, automatically identifies which of those parts were successfully proven, and iterates on the remaining parts to build a correct proof that is guaranteed to be sound, despite the reliance on unsound LLMs. We evaluate Cobblestone on four benchmarks of open-source Coq projects, controlling for training data leakage. Fully automatically, Cobblestone outperforms state-of-the-art non-LLM tools, and proves many theorems that other LLM-based tools cannot, and on many benchmarks, outperforms them. Each Cobblestone run costs only $1.25 and takes 14.7 minutes, on average. Cobblestone can also be used with external input, from a user or another tool, providing a proof structure or relevant lemmas. Evaluated with such an oracle, Cobblestone proves up to 58% of theorems. Overall, our research shows that tools can make use of partial progress and external input to more effectively automate formal verification.

More math startups, but I’m not gonna single anyone out

As always, I’m out in the applebees parking lot informing everyone that the VC influx into formal methods (especially Lean) doesn’t turn into AI security gains by default. We need you and I to make sure that formal verification agents are directed at problems in real world AI security and safety. Some or perhaps most math companies are doing PR in 2025 so they can create software QA products in 2026. I talked to one, though, who is adamant they are not. They just want to solve math problems and aren’t terribly fussed with capturing more than 1% of the upside. I think these guys are the outlier: most can be expected to pivot to program synthesis in 2026. I’m happy to discuss operationalizations of this to take bets. See also this brief memo I just jotted down2.

I feel like if you have friends who want to keep their ear to the ground about this math automation stuff you should send this to them, that way I can corrupt them into AI security/safety slowly over time.

Share

SMBC recently hit the nail on the nose. It feels in the direction of if I had ranted at Zach Weinersmith himself, but he went way harder and took it way further.

I know a bunch of founders who are aiming to increase headcount soon

And while they have not literally posted their job ads yet, reach out and I can let you know if I think they’d like to talk to you.

I’d love to do some napkin math about the VC influx in FMxAI over the last year, and include projections for how it might grow in the next couple years. I’d also love for a reader to do it and I can just link to it here!

Clark Barrett’s appearance on Orpheus’ seminar series is finally scheduled

Separately, Orpheus has a Manifund post up to support the seminars and his other GSAI movement building efforts. Early endorsements by both of the FVAPPS coauthors.

TA2 is dead over at ARIA

I think we’ve finally had a newsletter published while ARIA does not have an active Safeguarded AI funding call. Achievement unlocked! The original TA2 plan to fund one team to the tune of 8 figures was replaced, rumor has it, in the 8th inning, with a plan to fund several teams to the tune of 7 figures each. I can’t find anything written down publicly about this, but it doesn’t feel like a “rumor i’m not supposed to leak” situation. Anyway, in the 9th inning they just canceled TA2 altogether

When we designed this programme, the world looked different. Today, the pace of progress in frontier AI models has fundamentally altered the path to our goals. We now expect that the intended technical objectives of TA2 will be attainable as a side effect of this progress, without requiring a dedicated R&D organisation. Instead of investing in creating specialised AI systems that can use our tools, it will be more catalytic to broaden the scope and power of the TA1 toolkit itself, making it a foundational component for the next generation of AI.

TA2, as you’ll recall, was to be a large investment in founding a new org that would be kinda frontier-lab like in some ways but it would specialize in being superpowered at the new specification - world model - proof cert language/stack that TA1 is shipping. I think it’s roughly true that advanced capability in this new language/stack can be accomplished as a side effect of what claude, gpt, grok, and gemini are going to do anyway. But I’m also surprised that the extent to which that is forecasted to be true now, by davidad and his team, wasn’t priced in back when the Plan™ was originally drawn up. Davidad just seems a little more bitter lesson / scaling pilled than being surprised enough by capabilities alone to pivot! I guess I have a mild impulse to speculate that the team got spooked by the difficulty of aligning incentives for this kind of org to get a 7 or 8 figure injection at its founding, but again I’d ultimately expect that to have been priced in when the Plan™ was originally published.

I’ve heard of at least one turbo sick venture that was germinated by the prospect of pursuing this grant, and I don’t think they’re giving up just yet. Watch this space.

Lisa from SL5 Taskforce still up at Manifund

Lisa from the SL5 Taskforce has a few more months to disperse up to $100k. SL5 Taskforce is interested in some applications of FMxAI, especially to cloud infrastructure hardening. The center of the venn diagram (between Progress in GSAI readers and the interests of the SL5 Taskforce) is not massive, but I think it’s big enough for us to discuss here!

Apply here, I think you also have to file a Manifund project as well.

Cool NASA slidedeck from the summer about formal methods at Langley

I think the “future” section on cyberphysical systems and comparing AI to cyberphysical systems is good. The AI4FV section is all the stuff we talk about, and FV4AI sections are all the stuff I’m constantly saying we’re not talking about. The highlights are the earlier parts, the historical context, IMO.

Rebranding the newsletter

Yall, I’m interested in your thoughts here. My priorities are shaped a lot by “ensure AI security knows what to ask the formal methods community for at crunch time” and the converse “ensure that formal methodsititians view the AI security community as their most important customer”. I think GSAI is not completely wrong, for the title of the newsletter! But I think it could be even less wrong. The main problems I see with “guarantee” is that it doesn’t evoke swiss cheese, and I think formal methods are a source of swiss cheese!

I also have to disambiguate AI security as I mean it (boxing) from what e.g. Palisade means (offensive infosec capabilities). Watch this space etc.

Also if you want to help me with the newsletter we could be a team! I don’t think I could get back up to once per month without some friends to help.

There are no benefits to being a paid subscriber. I’m also frequently late as heck to post, as evidenced by this issue being a 4-month roundup. The help is encouraging and meaningfully makes me keep doing this. Anyway, subscription without paying me just puts it in your inbox right when they come out.

1

Caveat: this problem was way worse 1-2 years ago. I think as the Lean3->Lean4 migration cooled down, velocity as apparent in pretraining data got lower, plus language models getting much more powerful. Overall I’m curious if this velocity/version control problem will remain persistent, not turbo confident it will.

2

I think this shortform is pretty important, I do regret being unnuanced about curryhoward in the past, separately my worldview has gotten more nuanced and changed in other ways. Or I think back when I said “something something curryhoward” more, being directionally correct was enough, but now that more people are saying it, it’s time to be not just directionally correct but precise. You get the idea.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

形式化方法 人工智能 AI安全 定理证明 属性基测试 Formal Methods Artificial Intelligence AI Safety Theorem Proving Property Based Testing
相关文章