Interconnects 12小时前
AI模型生态更新:评估差异、新模型与发展趋势
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本次更新聚焦AI模型生态的最新动态。首先,CAISI报告对DeepSeek 3.1的评估结果与社区认知存在差异,尤其在SWE-bench上,因评估框架(harness)的限制,未能充分体现模型能力。报告还对比了HuggingFace模型下载量,指出不同统计方法(如排除特定模型类型、量化版本及异常值)会产生显著差异。随后,文章重点介绍了GPT-OSS模型的实用性提升及其在模型下载量上的亮眼表现,并列举了IBM的granite-4.0、Qwen3-VL-235B、GLM-4.6、Ling-1T、moondream3-preview等一系列近期发布的优秀模型,涵盖了不同参数规模、混合注意力机制、多模态能力及特定应用场景的创新。

🔍 **模型评估的挑战与方法论差异**:CAISI报告对DeepSeek 3.1的评估结果与社区认知存在出入,尤其是在SWE-bench基准测试上,评估框架的局限性影响了模型能力的准确呈现。此外,不同机构在统计模型下载量时,由于过滤标准(如模型发布时间、类型、量化版本及异常值处理)不同,导致数据差异显著,凸显了数据清洗和统计方法的重要性。

🚀 **GPT-OSS模型实用性提升与社区反馈**:GPT-OSS模型在发布后克服了技术实现上的困难,特别是其20B和120B版本在近期的下载量表现强劲,并有社区反馈称其性能优于部分流行模型。这表明该模型系列在实际应用中展现出可靠的性能和潜力。

💡 **近期优秀开源模型亮点**:文章重点介绍了多款近期发布的模型,包括IBM的Granite 4.0(混合注意力模型,音调沉稳)、Qwen3-VL系列的更新(支持多模态,8B版本值得关注)、GLM-4.6(性能接近闭源模型)、Ling-1T(规模庞大,架构多样)以及Moondream3-Preview(MoE架构,独特商业许可)。这些模型在性能、架构和应用场景上各具特色,代表了当前开源AI模型发展的前沿。

📈 **中国开源模型发展迅猛**:文章多次提到中国AI实验室在模型研发上的快速进步,如Qwen、Zhipu(GLM)和Meituan(LongCat)等,其模型在性能上已能与顶尖闭源模型相媲美,尤其在长文本处理和多模态能力方面表现突出,显示出强劲的追赶势头。

Before getting into the latest artifacts, there are a couple of pieces of crucial open ecosystem we have to cover.

First, the Center for AI Standards and Innovation (CAISI) released a report that observed the ecosystem and evaluated DeepSeek 3.1 against leading closed models. The evaluation scores they highlighted show some discrepancy with accepted results in the community. While MMLU-Pro, GPQA and HLE are close to the self-reported scores from DeepSeek and within usual error bars1, the SWE-bench Verified scores are off by a wide margin due to a weak harness for the benchmark. The harness is the software framework the model is used in for agentic benchmarks and has as great an impact as the model itself, as shown in this SWE-bench analysis by Epoch AI.

The CAISI report thus undersells the capabilities of DeepSeek’s models on a core benchmark for recent models (e.g. it is one of the benchmarks that Anthropic most heavily relies on for marketing of Claude).

Later in the report, CAISI shows a graph with cumulative download numbers from HuggingFace (left), something we also show on atomproject.ai (middle, right). However, our numbers differ greatly from the numbers from CAISI and those differ even more from the ones by HuggingFace itself. So, what is going on?

In short, it depends on which data you look at and how you clean it. For ATOM Project, we only consider models which were released after ChatGPT and are LLMs (based on our assessment). This excludes models like GPT-2 (which is the reason why OpenAI trumps all in the CAISI number, left), BERT-like models and ViTs like SigLIP (which are dominating the Google download numbers).

On top of this, we performed basic outlier filtering on daily downloads per model. Many models, such as Qwen 2.5 1.5B, which is one of the most downloaded models of all time, has extreme outliers on the order of 10M+ downloads that can heavily skew the overall numbers. These outliers affect every organization, but in different magnitudes. Furthermore, we also exclude quantized (like FP8, MLX or GGUF) versions, as those might skew the numbers.

Share

The second news item is sharing an update on the utility of GPT-OSS — when the model first dropped it was plagued by implementation difficulties downstream of architecture choices (e.g. a new 4-point precision) and complex tool use (multiple tool options per category). OpenAI is actually ahead of the curve on the complexity of tools they support with these models among open options. Since release, the use of GPT-OSS’s 20B and 120B models is very strong with 5.6M and 3.2M downloads in the last month, respectively. These models are outperforming some popular models, such as Qwen 3 4B or Qwen3-VL-30B-A3B-Instruct. Additionally, I got very strong feedback from the community when I did a basic pulse check on the models. These are one of the first models I’d try on my new Nvidia DGX-Spark to get a feel for things.

Artifacts Log

Our Picks


In the rest of the issue we highlight the long-tail of models, which again highlights the sweeping approach we’ve seen throughout the year from Qwen, but with continuing support from other rising Chinese labs. One of the sad things in this issue is that there are actually 0 datasets that cleared our bar of relevance. Open data continues to be in a very precarious position.

Models

Flagship

Qwen3-Next, or to say, a preview of our next generation (3.5?) is out!

This time we try to be bold, but actually we have been doing experiments on hybrid models and linear attention for about a year. We believe that our solution should be at least a stable and solid solution to new model architecture for super long context!

Read more

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI模型 开源模型 模型评估 DeepSeek GPT-OSS Granite Qwen GLM Ling Moondream AI生态
相关文章