ThursdAI - Recaps of the most high signal AI weekly spaces 09月25日
AI模型进展速览
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本周AI领域多家公司发布了重要模型更新,包括DeepSeek V3.1混合推理模型在保持高质量的同时减少思考token消耗,ByteDance开源36B长上下文模型家族并加入“思考预算”控制,NVIDIA推出混合架构9B模型并公开大部分预训练数据集细节,Cohere发布111B多语言推理模型,以及Z.ai和清华大学开发的ComputerRL桌面代理框架在OSWorld上取得48%成功率。此外,IBM与NASA开源了用于太阳天气预测的Surya模型,Qwen发布了完全开源的20B图像编辑器,OpenAI引入了Agents.md标准化工具配置文件。这些进展推动了AI推理效率、长上下文处理和代理能力的提升。

🔍 DeepSeek V3.1混合推理模型通过结合V3的快速执行能力和R1的深度推理能力,在多项任务上以更少的思考token达到R1级别的表现,其混合推理模式允许模型在思考token内规划后切换到非思考token执行,简化了代理框架的构建。

📈 ByteDance Seed-OSS 36B模型家族基于Apache-2.0许可开源,训练于12T tokens,拥有512K上下文窗口,并引入了可调整的“思考预算”控制,允许用户在推理深度和速度之间进行权衡,同时支持4/8位量化。

📊 NVIDIA Nemotron Nano 9B V2采用混合Mamba+Transformer架构,声称比纯Transformer模型高6倍的吞吐量,并通过压缩和蒸馏技术支持单A10卡处理128K上下文,其预训练数据集大部分细节已公开,但许可协议包含限制性条款。

🌐 Cohere CMD-a Reasoning模型为111B密网模型,支持256K上下文,在BFCL榜单上以70%的成绩超过DeepSeek R1和GPT-OSS,采用Cohere非商业研究许可,适用于需要私有部署的团队,并支持23种语言。

🖥️ Z.ai和清华大学的ComputerRL框架通过将GUI操作与API调用统一,并在虚拟桌面集群上实现了强化学习,在OSWorld任务上取得48.1%的成功率,显著提升了桌面自动化能力,其训练系统通过gRPC启动数千个qemu-in-docker虚拟机,结合强化学习和监督微调。

🌞 IBM与NASA开源的Surya模型基于9年多仪器观测数据训练,用于预测太阳动力学和空间天气,模型代码已发布在Hugging Face,展示了开源模型在科学领域的实际应用价值。

🖼️ Qwen发布的20B图像编辑器完全开源,支持中英文高阶语义编辑(姿态调整、旋转、风格/IP创建)和低阶细节修改(添加/删除/插入),可进行物体替换、宽高比调整、角色身份一致性保持和风格迁移,运行于本地GPU即可使用。

📁 OpenAI引入的Agents.md文件旨在标准化代理开发工具的配置文件,通过在仓库根目录使用Markdown文件定义项目设置、构建、测试和运行规则,目前已支持Codex、Amp、Jules、Cursor、RooCode等工具,并已由数万公开仓库采用。

🔍 GPT-5在网页界面中存在截断长提示的问题,即使低于文档中规定的上下文限制,也可能突然忘记提示的下半部分,这可能是OpenAI的bug而非有意削弱,API接口似乎不受影响。

🌐 “Nano Banana”模型在图像编辑中展现出惊人的3D一致性,能够将物体无缝合成场景并保持光照和几何关系,甚至可以生成特定位置的3D网格和渲染新视角,其底层可能采用了3D世界建模技术,目前仅在LMarena平台上可用。

🚀 Weights & Biases公司正在推动DeepSeek V3.1模型上线其推理服务,并发布了Catnip容器化多代理编码工作空间,该工具可运行多个隔离的Claude Code会话或其他代理,每个会话拥有独立上下文和通知流,解决了并行代理相互干扰的问题。

Hey everyone, Alex here 👋

This week looked quiet… until about 15 hours before we went live. Then the floodgates opened: DeepSeek dropped a hybrid V3.1 that beats their own R1 with fewer thinking tokens, ByteDance quietly shipped a 36B Apache-2.0 long-context family with a “thinking budget” knob, NVIDIA pushed a faster mixed-architecture 9B with open training data, and a stealth image editor dubbed “Nano Banana” started doing mind-bending scene edits that feel like a new tier of 3D-aware control.

On the big-co side, a mystery “Sonic” model appeared in Cursor and Cline (spoiler: the function call paths say a lot), and OpenAI introduced Agents.md to stop the config-file explosion in agentic dev tools. We also got a new open desktop-agent RL framework that 4x’d OSWorld SOTA, an IBM + NASA model for solar weather, and Qwen’s fully open 20B image editor that’s shockingly capable and runnable on your own GPU.

Our show today was one of the shortest yet, as I had to drop early to prepare for Burning Man 🔥🕺 Speaking of which, Wolfram and the team will host the next episode!

Ok, let's dive in!

DeepSeek V3.1: a faster hybrid that thinks less, scores more (X, HF)

DeepSeek does this thing where they let a base artifact “leak” onto Hugging Face, and the rumor mill goes into overdrive. Then, hours before we went live, the full V3.1 model card and an instruct variant dropped. The headline: it’s a hybrid reasoner that combines the strengths of their V3 (fast, non-thinking) and R1 (deep, RL-trained thinking), and on many tasks it hits R1-level scores with fewer thinking tokens. In human terms: you get similar or better quality, faster.

A few things I want to call out from the release and early testing:

Hugging Face: https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base

Quick personal note: I’m seeing a lot of small, pragmatic improvements add up here. If you’re building agents, the hybrid mode plus tighter tool integration is a gift. DeepSeek V3.1 is going to be deployed to W&B Inference service soon! Take a look here to see when it's ready wandb.me/inference

ByteDance Seed-OSS 36B: Apache-2.0, 512K context, and a “thinking budget” knob (X, HF, Github)

I didn’t see much chatter about this one, which is a shame because this seems like a serious release. ByteDance’s Seed team open-sourced a trio of 36B dense models—two Base variants (with and without synthetic data) and an Instruct model—under Apache-2.0, trained on 12T tokens and built for long-context and agentic use. The context window is a native half-million tokens, and they include a “thinking budget” control you can set in 512-token increments so you can trade depth for speed.

They report strong general performance, long-context RULER scores, and solid code/math numbers for a sub-40B model, with the Instruct variant posting very competitive MMLU/MMLU-Pro and LiveCodeBench results. The architecture is a straightforward dense stack (not MoE), and the models ship with Transformers/vLLM support and 4/8-bit quantization ready to go. If you’ve been hunting for a commercial-friendly, long-context 30-something‑B with an explicit reasoning-control dial, this should be on your shortlist.

A neat detail for the training nerds: two Base releases—one trained with synthetic data, one without—make for a rare apples-to-apples study in how synthetic data shapes base capability. Also worth noting: they previously shipped a Seed-Prover specialized for Lean; it looks like the team is interested in tight domain models and generalists.

NVIDIA Nemotron Nano 9B V2: mixed architecture, open data, and long-context throughput (X, Blog, HF, Dataset, Try It)

NVIDIA shipped a fully open release of Nemotron Nano 9B V2—base, base-before-alignment/pruning, and a realigned reasoning model—and, crucially, they published most of the pretraining dataset details (~6.6T tokens across premium web, math, code, and SFT). That level of data transparency is rare and makes this a great base for fine-tuners who want reproducibility.

Under the hood, this is a mixed Mamba+Transformer architecture. NVIDIA is claiming up to 6x higher throughput versus a pure-Transformer peer (they compare to Qwen3-8B) and specifically highlight that they pruned a 12B down to 9B while preserving quality. They also note a single A10 can handle 128K context after compression and distillation passes, which is the kind of practical systems work that matters when you’re running fleets.

A couple of caveats. The license is NVIDIA Open Model License—not Apache-2.0—so read it; it includes restrictions around illegal surveillance and safety bypasses and has revocation clauses. Personally, I appreciate the data openness and the long-context engineering; as always, just make sure the license fits your use case.

If you’re into longer-context math/coding with small models, the numbers (AIME’25, RULER-128K, GPQA) are impressive for 9B. And if you fine-tune: the availability of both pruned and pre-pruned bases plus the dataset recipe is a rare treat.

Cohere’s Command-A Reasoning: dense, multilingual, and research-only licensing (X, Blog, HF)

Cohere Dropped a new reasoning model focused on enterprise deployment patterns. It’s dense 111B model, supports a 256K context, and includes very strong multilingual coverage (23 languages is what they called out). What caught my eye: on the BFCL (Berkeley Function-Calling Leaderboard) they show 70%—above DeepSeek R1’s ~63% and GPT-OSS’s ~61%—and they plot the now-familiar test-time compute curves where more thinking tokens yield higher scores.

This release uses Cohere’s non-commercial research license; if you want commercial usage you’ll need to go through them. That said, for teams who need privately deployable, on-prem reasoning and can work under a research license for prototyping, it’s a serious option. A meta observation from the show: there’s accumulating evidence that more active parameters help multi-hop tool-use chains compared to very sparse MoE at similar effective capacity. This model nudges in that direction.

Desktop agents leap: ComputerRL hits 48% on OSWorld (Paper)

A new framework dubbed ComputerRL from Z.ai and folks at Tsingua Uni, unified API calls with GUI actions and scaled RL across fleets of virtual desktops, posting a 48.1% success rate on OSWorld versus ~12% for earlier open models. The training system spins up thousands of qemu-in-docker VMs via gRPC; the learning loop alternates RL with supervised fine-tuning and uses a clean step-level binary reward to simplify credit assignment. If you care about practical desktop automation across Ubuntu/Windows/macOS, this is a big jump.

IBM + NASA’s Surya: open model for solar weather (HF)

Scientists get some love: IBM and NASA open-sourced Surya, a transformer trained on nine years of multi-instrument observations (nearly 200 TB) to forecast solar dynamics and space weather—the stuff that can knock satellites and power grids sideways. It’s on Hugging Face, it’s actually runnable, and it’s a fantastic example of open models delivering real-world scientific utility.

Smaller but notable: InternLM and OpenCUA, plus Intel’s quants

Two quick flags from the “worth your time” pile. InternLM shipped S1 Mini, an 8B vision+language model (ViT on top) that’s multimodal and lightweight; if you need on-device omni-ish behavior on a laptop or tablet, give it a look. And OpenCUA 32B (Qwen-based) is a specialized computer-usage agent with strong scores; if you’re building automations that need native OS control, it’s worth benchmarking.

Also, if you’re running 4-bit: the Intel quantization work is excellent right now. Their 4-bit quants have been extremely high precision in my testing, especially for large coders and reasoners like DeepSeek V3.1. It’s an easy win if you’re trying to squeeze a 30B+ onto a workstation without hemorrhaging quality.

Big-co updates and platform shifts

Sonic appears in Cursor and Cline

If you open Cursor or fire up Cline, you may see a new “Sonic” model toggle. It’s labeled as a reasoning model and, when you poke the function-calling internals, the call paths include “xai/…” strings. Folks report it’s fast and solid for coding. No official docs yet, but I’d be surprised if this isn’t Grok Code in pre-release clothes.

Agents.md: one file to rule your agents

Agentic dev stacks have multiplied config files like gremlins: Cursor’s rules.json, Windsurf’s prompts, MCP server manifests, tool schemas, install scripts… and every tool wants a different filename and format. OpenAI’s Agents.md is a strong attempt at standardization. It’s just Markdown at repo root that says, “here’s how to set up, build, test, and run this project,” plus any agent-specific caveats. Tools then auto-detect and follow your instructions instead of guessing.

It’s already supported by OpenAI Codex, Amp, Jules, Cursor, RooCode, and more, with tens of thousands of public repos adopting the pattern. In monorepos, the nearest Agents.md wins, so you can override at the package level. And human chat instructions still override the file’s guidance, which is the right default.

GPT‑5 context truncation in the web UI (reports)

There’s been a wave of reports that GPT‑5 in the web UI is truncating long prompts even when you’re under the documented context limit. The folks at Repo Prompt reproduced this multiple times and got confirmation from OpenAI that it’s a bug (not a deliberate nerf). If you saw GPT‑5 suddenly forget the bottom half of your carefully structured system prompt in the web app, this likely explains it. The API doesn’t seem affected. Fingers crossed for a quick fix—GPT‑5 is still the best model I’ve used for 300k‑token “read the whole repo and propose a plan” tasks.

Image and 3D: Nano Banana and Qwen’s open image editor

Nano Banana: 3D-consistent scene editing from thin air

A stealth model nicknamed “Nano Banana” surfaced in a web demo and started doing the kind of edits you’d normally expect from a 3D suite with a modeler at the controls. Take two photos—your living room and a product shot—and it composites the object into the scene with shockingly consistent lighting and geometry. Ask for a 3D mesh “five inches off the skin,” and the mesh really does offset. Ask for a new camera angle on a single still, and it renders the scene from above with plausible structure. People have been calling it a game-changer and, for once, it doesn’t feel like hyperbole.

There’s a strong whiff of 3D world modeling under the hood—some volumetric representation or neural field that enables true view synthesis—and Logan Kilpatrick posted a banana emoji that set speculation on fire. We’ll see where it lands and under what license, but for now the demo has been doing the rounds and it is… wow.

If you’re wondering where to try it: LMarena is the currently only way to give it a try, but it's supossedly dropping soon!

Qwen Image Edit (20B): fully open and already practical (X, HF)

Qwen shipped a 20B image-editing model layered on their existing vision stack, and it’s properly open (permissive license) with strong bilingual text editing in Chinese and English. It handles high-level semantic edits (pose adjustments, rotations, style/IP creation) and low-level touch-ups (add/remove/insert). You can swap objects, expand aspect ratios, keep character identity consistent across panels, and do clean style transfer. It runs locally if you’ve got a decent GPU.

What I appreciate here is the precision. Product placement tasks like “put this book in this person’s hand at this angle,” or “make the shoes match the dress” come out with the kind of control that used to require hand masking and a dozen passes. And yes, the capybara mascot is back in the release materials, which made my day! 👏

If Nano Banana is the closed-world demo of what’s “beyond SOTA,” Qwen Image Edit is the open tool you can actually ship with today.

This week’s buzz from Weights & Biases

Two quick updates from our side. First, we’re working to bring DeepSeek V3.1 to our inference as fast as we can so you can run serious benchmarks without fussing over serving stacks. Keep an eye on our channels; we’ll shout when it’s live and we’ll have some credits for early feedback.

Second, our cofounder Chris Van Pelt released Catnip, a containerized multi‑agent coding workspace that runs multiple Claude Code sessions (or other agents) in isolated sandboxes, each with its own context and notification stream. If you’ve been juggling parallel coding agents that step on each other’s toes, this is catnip indeed.

Catnip Github: https://github.com/wandb/catnip

Closing thoughts

A year ago, “thinking tokens” weren't even a curiosity; We only got the first whiff of "reasoning" back in September, and now we’re watching hybrid models that do more with less thinking, tool calls woven inside the reasoning loop, and long-context training regimes scaled up by an order of magnitude. The agent stack is maturing fast—desktop RL is finally clearing real tasks; editor ecosystems are converging on a single config file; and even the stealth drops are clearly building toward world-model‑aware editing and control.

If you only try two things this week: run DeepSeek V3.1 in both modes (planning with thinking on, execution with thinking off) and throw a complex multi-step tool workflow at it; then take Qwen Image Edit for a spin on a real product-placement or character-consistency task. You’ll feel the future in your hands.

I’m off to the desert next week for a bit (no internet where I’m going), but Wolfram and the crew will keep the ship on course. If you’re at Burning Man, DM me—would love to say hi out there. As always, thank you for tuning in and nerding out with us every week.

— Alex


TL;DR and show notes

ThursdAI - Aug 21, 2025 - TL;DR

ThursdAI - Aug 21, 2024 - TL;DR

TL;DR of all topics covered:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI模型 DeepSeek V3.1 ByteDance Seed-OSS NVIDIA Nemotron Nano 9B V2 Cohere CMD-a Reasoning ComputerRL Surya Qwen Image Edit Agents.md Nano Banana
相关文章