IEEE Spectrum 09月12日
美国防创新单位开发AI系统辅助军事规划
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

美国国防创新单位(DIU)正领导名为Thunderforge的实验项目,旨在构建一个具备多个数字“代理”的定制化AI系统。这些代理将跨不同军事领域对作战计划进行批评性分析,并行运行分析,标记人类规划者可能忽视的潜在弱点。该项目初期将面向美国印太司令部和欧洲司令部的指挥官和规划者,最终将查询内部数据库,运行国防部级模拟,并与现有软件集成,如DARPA的SAFE-SiM建模和模拟架构,以创建大量真实可信的军事场景进行研究。Scale AI负责该项目,微软提供大型语言模型(LLM)技术,Anduril提供建模支持。

💡Thunderforge项目由美国国防创新单位(DIU)领导,旨在开发一个包含多个数字代理的定制化AI系统,用于跨领域对作战计划进行批判性分析和并行分析,以识别人类规划者可能忽视的潜在弱点。

📊该项目将与美国印太司令部(INDOPACOM)和欧洲司令部(EUCOM)的指挥官和规划者合作,最终将查询内部数据库,运行国防部级模拟,并与DARPA的SAFE-SiM建模和模拟架构等现有软件集成,以创建大量真实可信的军事场景。

🤖Scale AI负责该项目,微软提供大型语言模型(LLM)技术,Anduril提供建模支持。这些代理将动态协作,融合各自的分析结果,为作战规划者提供更全面的视角。

🛡️该项目旨在将操作员的角色从‘在回路中’(micromanaging)转变为‘在回路中’(applying strategic judgment),即从微观管理单个过程转变为对AI生成的选项进行战略判断。

🔍尽管AI在结构化、边界明确、低风险的数据领域表现出色,但军事行动往往混乱、复杂且罕见,且失败可能导致灾难性后果,因此需要谨慎对待AI在军事规划中的应用。



The Defense Innovation Unit (DIU), part of the U.S. Department of Defense (DOD), is leading an experimental project, Thunderforge, to build a custom agentic AI system with multiple digital “agents” critiquing war plans across different military domains, running parallel analyses, and flagging potential weaknesses neglected by human planners.

The U.S. Indo-Pacific Command (INDOPACOM) tested some of Thunderforge’s AI capabilities during a tabletop exercise in June. The project, first announced in March, remains in the early stages, with an initial rollout targeting commanders and planners across INDOPACOM and its sister command in Europe (EUCOM). Ultimately, the system will query internal databases, run DOD-grade simulations, and integrate with pre-existing software, such as DARPA’s SAFE-SiM modeling and simulation architecture, to create vast numbers of realistic and plausible military scenarios for study.

California-based Scale AI is steering the project, with Microsoft supplying its large language model (LLM) technology and Anduril providing modeling. Scale AI says its system is designed to coordinate multiple custom agents, each leveraging a range of models and acting as a digital staff officer to help synthesize data for mission-critical planning activities.

“These agents dynamically collaborate, fusing separate analyses into a more comprehensive view for operational planners to consider,” says Dan Tadross, the head of public sector at Scale AI. “This approach is designed to shift the role of the operator from being ‘in the loop’—micromanaging a single process—to being ‘on the loop,’ where they can apply their strategic judgment to the options generated.”

Enhancing AI in Military Planning

The DIU has split Thunderforge’s development into two tracks, starting with augmenting the cognitive plan-writing process through red-teaming. The system will submit a human-authored plan to a team of AI agents to offer perspectives across multiple domains—including logistics, intelligence, and cyber and information operations. “You can really customize it however you want,” says Bryce Goodman, the DIU’s chief strategist for AI.

The second track will link with DOD’s most advanced modeling software to conduct modeling simulations, generate and analyze outputs, and interpret the results. Even for humans, building simulations and understanding the raw results requires extensive training in a specialized skillset. Goodman says, “Where that becomes really powerful is that now, the LLMs are calling on tools the DOD has built and validated, and suddenly you’re combining the brute pattern recognition that AI does with physics-based simulations or other logical reasoning tools that perform similar tasks. It’s scaffolding in all those capabilities.”

With track one underway, the DIU has already developed a minimum viable product, though it requires deeper integration with classified data systems and separate government partnerships. As the “plumbing” gets built, Goodman says, his team hopes to have the system validating existing scenarios by the end of this year, and then writing scenarios from scratch sometime in 2026.

Janus was a conflict simulation model developed by Lawrence Livermore National Laboratory in the 1970s.Lawrence Livermore National Laboratory

AI’s Role in Conflict Simulation

Conflict simulation has long been a fixture of operational planning and training, with wargame participants mapping out battlefield movements on sand tables. In the 1970s, Lawrence Livermore National Laboratory introduced Janus, the world’s first player-interactive combat simulator operating in near real-time. It was a key planning tool for the United States’ 1989 invasion of Panama and, later, Operation Desert Storm.

Today’s agentic AI technology goes further with autonomous, multistream data processing, says political scientist Stephen Worman, director of the RAND Center for Gaming, which specializes in wargames and strategic simulations for the U.S. military. “These capabilities can improve situational awareness, speed threat modeling iterations, and streamline resupply or pre-deployment planning.”

“That said, the downsides warrant caution,” Worman adds, citing a Texas National Security Review paper written by Jon Lindsay, an associate professor at the Georgia Institute of Technology specializing in cybersecurity, privacy, and international affairs. “He underscores that AI excels in structured, bounded, low-stakes domains with routine data, but military operations are often messy, chaotic, and rare events,” says Worman. “There’s a premium on getting it right, and failure can be catastrophic.”

Without formal frameworks to interpret why AI agents make decisions or how much trust to place in their reasoning, the resulting opacity creates a false sense of precision, Worman says, adding, “An agent might perform plausibly in most situations, but it might simply be amplifying biases or exploiting flaws in the underlying model.”

Goodman admits the project faces significant research challenges. LLMs sometimes confidently hallucinate answers. An LLM might present a convincing plan that, on closer inspection, sends a warship plowing through Australia. Goodman notes that this kind of output seems thoughtful until you interrogate the underlying reasoning and find no real logical coherence.

“My baseline assumption is that LLMs are going to hallucinate and be flawed and opaque, and we’re not going to understand all their failure modes,” Goodman says. “That’s why it’s more about understanding the user’s context. If my goal in generating an output is to spur thought for me, but not get a finished product, then the hallucinations are less of a concern. That’s why we’re starting with critiquing human plans.”

Hallucinations can also be diminished by allowing an agent to call upon external tools. For example, instead of guessing how many tanks are deployable in a zone, the AI agent queries a database for the latest information.

Scale AI cites additional agentic safeguards like tracing for true explainability, which allows an operator to see the specific chain of evidence and reasoning that led to a conclusion. “Continuous adversarial testing to actively probe for hidden biases, vulnerabilities, or potential for misalignment is a powerful tool before agentic deployment,” Tadross says. “Where applicable, formal verification methods provide a way to mathematically prove that an agent’s behavior will remain within pre-defined, acceptable bounds.”

LLMs were presented three escalation options in response to four action categories: attack, blockade, clash, and declare war. They responded with varying degrees of force, some more pacifist than others. Yasir Atalan

Evaluating AI in Foreign Policy

Scale AI and the Center for Strategic and International Studies developed a Critical Foreign Policy Decisions Benchmark to understand how leading LLMs responded to 400 expert-crafted diplomatic and military scenarios. Models like Qwen2 72B, DeepSeek V3, and Llama 8B Instruct showed a tendency toward escalatory recommendations, while GPT-4o and Claude 3.5 Sonnet were significantly more restrained. The study also found varying degrees of country-specific bias across all models, often advising less interventionist or escalatory responses toward Russia and China than the United States or the United Kingdom.

“It depends which model you pick. Not all models are trained on the same data, and their model architectures aren’t always the same,” says Yasir Atalan, a data fellow at the Center for Strategic and International Studies who co-authored the benchmarking analysis. “You can train and fine-tune a model, but its baseline preconditions and inclinations can still be different from others. That’s where we need more experimentation. I think Thunderforge can give us a picture of whether these models are complementing, or to what extent they can achieve these goals or fail.”

A separate but similar line of inquiry comes from Jacquelyn Schneider, a Hoover Institution fellow and professor at the Naval Postgraduate School. Last year, Schneider and colleagues at Stanford University evaluated five off-the-shelf LLMs in military and diplomatic decision-making. All models exhibited hard-to-predict escalation patterns.

“I expected to see a lot more divergence between LLMs,” Schneider says. “For me, the tendency toward escalation is a puzzle. Is it because the corpus of knowledge focuses only on escalation? De-escalation is hard to study since it never happens.”

Schneider cautioned that even tightly scoped AI systems can have unintended strategic effects if commanders trust outputs without fully understanding how they were produced. She says it’s important to train users to build campaigns without the software, or to spot underlying problems that could hinder decision-making during combat.

“For the last 20 years, the U.S. has built campaign plans for the global War on Terror without our information structures constantly being targeted as they would in a scenario against a competent adversary. Some data might be manipulated on a day-to-day basis due to electromagnetic interference or cyberattacks, but the scope and scale increases exponentially in wartime, when the software becomes a target,” Schneider says. “How the system works in a peacetime environment represents the best possible situation, but how it degrades in a combat environment could be the standard of how closely it becomes a part of combat planning processes.”

Thunderforge’s creators and outside experts both emphasize the importance of human oversight. Tadross says Scale AI’s role is to provide the assets that military personnel need to deter conflicts or gain winning advantages if forced to fight. The DOD ultimately defines the doctrine, rules of engagement, and appropriate level of human oversight for any given mission.

“While the system can be tasked to generate and assess courses of action based on those defined parameters, the ultimate decision-making authority always rests with the human commander,” Tadross says. “Our responsibility is to ensure the technology provides clear, understandable, and trustworthy support that empowers their judgment and accelerates their decision-making at the speed of relevance.”

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

美国国防创新单位 Thunderforge AI军事规划 大型语言模型 数字代理
相关文章