ThursdAI - Recaps of the most high signal AI weekly spaces 09月25日
开源AI与AI工具的进展
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本周开源AI领域由中国实验室引领,发布了数个重要模型,如Kimi-Dev-72B和MiniMax-M1。同时,Google将Gemini 2.5 Flash和Pro从预览版正式推出,并推出了新的Flash-Lite模型。OpenAI则增加了MCP支持并推出了会议录制功能。此外,Weights & Biases推出了新的推理服务和在线评估工具,而Yupp.ai则提供了一个包含500多个模型的免费平台,用户可以通过评分模型输出来获得积分。在视频领域,ByteDance的Seedance 1.0和MiniMax的Hailuo 02在文本到视频和图像到视频的任务中取得了显著成绩。Midjourney也推出了其视频版本V1。语音技术方面,Kyutai发布了实时语音转文字模型。研究方面,LiveCodeBench Pro揭示了顶级LLM在现实世界编码问题上的不足,而MIT的研究则表明频繁使用ChatGPT可能导致认知能力下降。

🌟 Kimi-Dev-72B由Moonshot AI开源,专为软件工程优化,在SWE-bench验证中达到60.4%,成为开源编码基准的新冠军。它通过在Docker中修补实际GitHub仓库并仅在完整测试套件通过时获得奖励来进行训练,但在其舒适区之外(如写作或一般问答)表现较弱。

📈 MiniMax-M1是一个1M上下文窗口、456B参数的推理巨人,采用混合专家混合和闪电注意力机制,在SWEBench验证中达到56.0%,AIME 2024(数学)中达到86.0%,FullStackBench(编程)中达到68.3%,并在OpenAI的MRCR(128k)和1百万版本上达到73.4分,与封闭式Gemini模型齐平。

🔗 Google将Gemini 2.5 Flash和Pro从预览版正式推出,并推出了新的Flash-Lite模型,后者在家庭自动化系统中表现出色,每百万输入令牌成本仅为10美分,每秒运行400多个令牌,适用于高容量低延迟任务。

📊 OpenAI在ChatGPT中为Deep Research增加了MCP支持,尽管仅限于只读操作,但允许创建新工具连接公司数据或第三方服务,显示了对MCP的采用。

🎙️ OpenAI推出了会议录制功能,能够完美转录2小时节目,识别说话者姓名,创建带时间戳的章节摘要,并为听众生成行动项目,展示了强大的语音识别和摘要能力。

🚀 Weights & Biases推出了新的推理服务和在线评估工具,提供统一界面运行顶级开源模型,并支持实时监控生产调用中的LLM性能。

🏆 Yupp.ai平台提供500多个免费模型,包括GPT-4.5、Claude Opus 4等,用户通过评分模型输出来获得积分,并提供详细的排行榜和“氛围评分”系统。

🎬 ByteDance的Seedance 1.0在文本到视频和图像到视频的任务中击败了Google的VEO 3,而MiniMax的Hailuo 02则实现了原生1080p,并遵循指令解析器,提供了更清晰的视觉效果。

🖌️ Midjourney推出了其视频版本V1,将任何图像转换为四个五秒钟的视频,保持了其独特的“MJ美学”,尽管分辨率仅为480p。

🗣️ Kyutai发布了实时语音转文字模型,其2.6B参数的英语模型在基准测试中优于Whisper Large v3,并提供低延迟的英语/法语版本,适用于实时应用。

Hey all, Alex here 👋

This week, while not the busiest week in releases (we can't get a SOTA LLM every week now can we), was full of interesting open source releases, and feature updates such as the chatGPT meetings recorder (which we live tested on the show, the limit is 2 hours!)

It was also a day after our annual W&B conference called FullyConnected, and so I had a few goodies to share with you, like answering the main question, when will W&B have some use of those GPUs from CoreWeave, the answer is... now! (We launched a brand new preview of an inference service with open source models)

And finally, we had a great chat with Pankaj Gupta, co-founder and CEO of Yupp, a new service that lets users chat with the top AIs for free, while turning their votes into leaderboards for everyone else to understand which Gen AI model is best for which task/topic. It was a great conversation, and he even shared an invite code with all of us (I'll attach to the TL;DR and show notes, let's dive in!)

Open Source AI: China at the Helm, Benchmarks Reborn

Just a year ago, a 16K context window was astonishing. Now? The big action is one million tokens and “stochastic people-spirits” that claim to rival world-class coders. Here are the highlights:

Moonshot AI Kimi-Dev-72B: Coding’s New Open-Source Champion? (HF | Github)

Meet Kimi-Dev-72B : an open weights LLM, optimized specifically for software engineering, and it isn’t shy: 60.4% on SWE-bench Verified, making it the new open-source SOTA for coding benchmarks.

The trick? Kimi doesn’t just spit answers—it's trained to patch actual GitHub repos inside Docker, gets a reward only when full test suites pass. That’s serious RL polish. But—and here’s the rub—outside its comfort zone (i.e., writing or general Q&A), it’s not as strong. The moment you step outside SWE-bench, expect a drop-off. Several in the AllHands community have already prodded at the “fixed workflow” design and found inconsistency versus real-world tooling, especially for broader agentic tasks.

So, huge respect for pushing the coding benchmark ceiling, but as ever: vibe ≠ pure leaderboard—so test for your use case, not just the tweetable score.

MiniMax-M1: The New 1M-Context, 456B-Param Reasoning Titan (paper | HF | Demo)

If there’s one model everyone buzzed about, it’s MiniMax-M1. In a field packed with megamodels, M1 stands out not just for its 456 billion total parameters (45B active per forward pass), but for:

Benchmarks:

Though, numbers aside, I've tried it (leaning on the impressive long context) and I got some... interesting results. The thinking part of it was great, but it wasn't at all good at keeping with my instructions, or my writing style. It in fact didn't adhere to my style at all. But hey, different models serve different purposes right?

Open Source: The China Shift

If you’re following the pattern, you’ll notice that the open-source frontier is almost entirely being led by Chinese labs: Qwen, DeepSeek, MiniMax, Moonshot. Restrictions are real, but progress is relentless. If you’re building on top of OSS AI, it’s time to get comfortable reading Hugging Face and Github docs in translation.

Big Companies, APIs & LLMs

Gemini 2.5: From Preview to Production

Google made moves this week, graduating Gemini 2.5 Flash and Pro from preview to general availability. But the real star is the new Flash-Lite model in preview.

Wolfram immediately put Flash-Lite into production for his home automation system and couldn't stop raving about it: "The latency has been reduced when I give a command and the house reacts... even its German is very good!"

At just 10 cents per million input tokens and running at 400+ tokens per second, this tiny reasoning model is finding its niche. Sure, it's not going to win any benchmarks (66% on GPQA Diamond), but for high-volume, low-latency tasks? This is exactly what we needed.

OpenAI: MCP Support and That Meeting Recorder Though...

OpenAI quietly added MCP (Model Context Protocol) support to Deep Research inside ChatGPT, though it's limited to read-only operations with search and fetch tools. What this means, that anyone (and companies) can create a new tool for DeepResearch to connect to, whether it's your company data or a third party service that you want DeepResearch to use.

Though limited to Deep Research only, this shows incredible MCP adoption, and will likely make its way to standard ChatGPT soon.

But the real surprise came when we tested their new meeting recorder feature live during the show. I mean... wow. Not only did it transcribe our entire 2-hour episode perfectly, but it also:

Yam's reaction said it all: "That's crazy useful! Seriously, that's crazy!"

The model diarized (speaker identification) and took the names from the context, which was really awesome. So when I introduced myself as Alex Volkov during the show, or my Guest Pankaj, it identified those names and then applied them to "speaker 01" and "speaker 02" as labels. It did mess up thought, attributing Nistens name to Pankaj voice, leading to some funny hallucinations in summary, making Nisten the CEO of Yupp. 😅 But hey, it's a research preview!

Zuck is on a spending spree:

Lastly, a quick Meta update—Zuck’s on a spending spree, and word is he’s eyeing Nat Friedman and Daniel Gross, powerhouse investors with a killer AI portfolio, to join Meta’s superintelligence push. After last week’s Scale AI move, this signals big ambitions. I’m watching this space closely!


🐝 This Week’s Buzz: Weights & Biases Drops Game-Changing Tools at Fully Connected

Alright, time for some home team news from Weights & Biases Weave, and I’m pumped to share what went down at our Fully Connected conference. We hosted a great event with way more people than we had room for—and unveiled some seriously cool stuff since joining forces with CoreWeave. Two big launches stand out, and I’m itching to see how you’ll use them.

First, W&B Inference is here in preview, giving you a unified interface to run top open-source models like DeepSeek R1-0528 and Phi 4 Mini 3.8B directly from our platform, backed by CoreWeave GPUs. Pricing is competitive, and even free-tier users get credits to play around. Whether through our playground for side-by-side comparisons or an OpenAI-compatible API, it’s seamless. I’m all about making powerful AI accessible, and this is a step in that direction.

Second, for my evaluation nerds out there (you know I’m one of them!), W&B Weave now supports Online Evaluations. This means real-time monitoring of production calls with an LLM as a judge—set it up in minutes to track performance drops on a subset of your traces. You can use our inference service or plug in your own keys from OpenAI, Anthropic, or others. It’s a game-changer for deploying AI reliably, and I walked through the setup live on the show. Both tools tie into our “metal-to-token” observability vision, tracking everything from hardware to application metrics. Try them out and hit me up with feedback—I’m all ears!


Evaluation Revolution: Enter Yupp.ai

Speaking of evaluations, we had Pankaj Gupta, co-founder and CEO of Yup.ai, join us to discuss their fresh take on model evaluation.

What makes Yup different:

Because this is a real product (that's VERY polished), and users are incentivized to build their profiles, not to mention, getting paid, they can do analysis on age, profession, location of users. So for example, this is the first place I know of, that can tell me what's the best model that young users prefer (currently Gemini Flash Lite!)

The platform launched less than a week ago and already has 400,000+ conversations for evaluation, and we'll be definitely taking a look, as its angels include folks like Jeff Dean from DeepMind and Josh Woodward and many other AI minds!

Arenas and Vibes?

Just as I'm writing this, the information disclosed that while LMArena claims to be impartial, more and more companies are asking Surge (a data labeling provider, which is a competitor to Scale) for data for LMArena optimization. This, in addition to the Cohere paper about LMArena from before, and our general vibes differences + their recent $100M funding round, makes me think that we need a few new places for Vibes from LLMs. I'm hoping Yupp can be that place personally.

If you'd like an invite and a few extra credits there, Pankaj gave all listeners of ThursdAI an invite ahead of the rest of the pack. Click here and enjoy rating models!*

Vision & Video: The Week 1080p Went Mainstream

Video models have been heating up, and oh boy, this week was a massive shakeup! VEO3 went from 1st to ... third, and A16Z showed their first AI Video market map.

ByteDance Seedance 1.0 (and Mini)

TikTok’s parent finally flexed its video data moat. Seedance 1 shot straight to #1 on the Artificial Analysis arena — beating Google Veo 3 in text-to-video and image-to-video. It still outputs 768p, but frame-by-frame sharpness and prompt coherence are ridiculous. While you can't use it directly today, you can test drive it's smaller brother (Seedance 1.0-mini) on Dreamina or Fal (they are calling it lite)

MiniMax Hailuo 02: Native 1080p, Physics-Aware Clips

Not to be out-performed, the same MiniMax team behind the M1 LLM pushed Hailuo 02. Native 1080p, 6-second clips, and an instruction parser that finally obeys camera moves (“slow pan left across a floating sushi conveyor” worked first try). Training tricks: Noise-aware Compute Redistribution (NCR) plus 4× the dataset of Hailuo 01. Early testers report 2.5× inference throughput over Veo at a fraction of the cost.

Midjourney Video V1

Midjourney fans, you’re now animated. The first web-only video beta turns any image into four five-second videos. It’s 480 p and costs 8× an image job, but the “MJ aesthetic” survives intact, which no one else matches yet. Just look at this crazy video someone just posted on X, it keeps the collage effect very well!

Voice Tech Breakthroughs: Kyutai’s Real-Time Speech-to-Text (X, website)

On the audio front, Kyutai, a French company we’ve hyped before for Moshi, just launched two open-source Speech-to-Text models that are turning heads. They’ve got a 2.6B parameter English-only model that outperforms Whisper Large v3 on benchmarks, and a 1B parameter English/French version optimized for low-latency at just 500ms delay.

What’s cool is the real-time streaming—transcribing as you speak, not after a buffer, with a semantic voice activity detector to know when you’ve stopped. It can handle up to 400 concurrent streams on a single H100 GPU! I see this as a hidden gem for voice chat apps or live transcription. Give it a try yourself here

Research and Tools Shaking Things Up: From Cognitive Concerns to Crowd-Powered Evals

Wrapping up, let’s touch on some research and tools that caught my eye. A new benchmark, LiveCodeBench Pro, dropped a bombshell—top LLMs like o4-mini-high score 0% on hard real-world coding contest problems from Codeforces and IOI, lagging at 2,100 Elo compared to human grandmasters at 2,700. It’s a stark reminder that while AI shines on routine tasks, deep logic and creativity are still human territory. Dive into the paper for the gritty details.

On a more sobering note, an MIT study with just 54 participants suggests frequent ChatGPT use might cause cognitive decline, showing a 47% drop in neural connections and memory retention issues after four months. It’s a small sample, but it raises questions about over-reliance on AI—something to chew on as we integrate these tools deeper into our lives.

Andrej Karpathy’s latest talk on “Software 3.0” is pure inspiration, framing English as the new programming language with LLMs as operating systems. It’s a vision of “vibe coding” democratizing software for billions—definitely worth a watch for any AI dreamer out there.


That's it, even for a realtively chill week, we've had an amazing show! Next week, I'm on vacation, and Wolfram and the gang will lead the show! Don't have too much fun without me yea? 😎 I'll be tuning in as a listener maybe.

Here's the TL;DR and show notes links

ThursdAI - June 19th, 2025 - TL;DR

*Disclosure: while this isn't a paid promotion, I do think that yupp has a great value, I do get a bit more credits on their platform if you click my link and so do you. You can go to yupp.ai and register with no affiliation if you wish.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

开源AI MiniMax-M1 Kimi-Dev-72B Gemini 2.5 OpenAI 会议录制 Weights & Biases Yupp.ai Seedance 1.0 Hailuo 02 Midjourney Video Kyutai Speech-to-Text LiveCodeBench Pro 认知能力下降
相关文章